Top Banner
1 | Page Faculty of computers and information Information System Department BIGDATA ANALYTICAL DISTRIBUTED SYSTEM Under Supervision Prof. Dr. Ali H. El-Bastawissy Eng. Ali Zidane Educational Sponsor Team Members Name ID Group Adel Awd El Agawany 20090183 IS-DS Ayman Mohamed Mahmood 20090067 IS-DS Hazem Ahmed Talat 20090101 IS-DS Mohamed Abd el-aal El-Tantawy 20090282 IS-DS Mohamed Ahmed Saber 20090250 IS-DS Mohamed Fayez Khater 20090293 IS-DS
158

Big data analytics graduation project documentation

Feb 06, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big data analytics graduation project documentation

1 | P a g e

Faculty of computers and information

Information System Department

BIGDATA ANALYTICAL DISTRIBUTED SYSTEM

Under Supervision

Prof. Dr. Ali H. El-Bastawissy Eng. Ali Zidane

Educational Sponsor

Team Members

Name ID Group

Adel Awd El Agawany 20090183 IS-DS

Ayman Mohamed Mahmood 20090067 IS-DS

Hazem Ahmed Talat 20090101 IS-DS

Mohamed Abd el-aal El-Tantawy 20090282 IS-DS

Mohamed Ahmed Saber 20090250 IS-DS

Mohamed Fayez Khater 20090293 IS-DS

Page 2: Big data analytics graduation project documentation

2 | P a g e

List of Figures .............................................................................................................. 5

Acknowledgment.......................................................................................................... 7

Abstract ........................................................................................................................ 8

PART One: Introduction to Big Data Analytics

Introduction .................................................................................................................. 9

Uses for big data ....................................................................................................... 10

Big data Technologies ............................................................................................... 11

Big data solutions ..................................................................................................... 12

Characteristics .......................................................................................................... 12

PART Two: Big Data Management Systems

Hadoop ........................................................................................................................ 14

What is Hadoop? .......................................................................................................... 14

Hadoop Features .......................................................................................................... 15

Structure of Hadoop ..................................................................................................... 16

File Systems ................................................................................................................. 17

Scheduling .................................................................................................................... 19

Advantages of Hadoop ................................................................................................ 21

Cassandra ................................................................................................................ 22

What is Cassandra? ..................................................................................................... 22

Tools for Cassandra ..................................................................................................... 22

Cassandra Data model ................................................................................................. 22

Advantages of Cassandra ........................................................................................... 22

Table of Contents

Page 3: Big data analytics graduation project documentation

3 | P a g e

Limitations .................................................................................................................... 23

DataStax ........................................................................................................................ 23

DataStax Features ........................................................................................................ 24

PART Three: Big Data QL (Hive QL vs CQL)

Hive QL ..................................................................................................................... 32

What is Hive QL? .......................................................................................................... 32

Hive Ql Advantages ..................................................................................................... 32

Hive QL Disadvantages ................................................................................................ 33

Hive QL Syntax and Features ...................................................................................... 33

Supported Platforms .................................................................................................... 56

Cassandra QL .......................................................................................................... 56

What is CQL? ................................................................................................................ 57

CQL Advantages ......................................................................................................... 57

CQL Disadvantages .................................................................................................... 57

CQL Syntax and Features ........................................................................................... 58

Supported Platforms ................................................................................................... 69

Part Four: Working Environment

Clustering.................................................................................................................... 70

Concept of Clustering ………………………………………………………………… ......... 70

Clustering characteristics .......................................................................................... 71

Clustering attributes ................................................................................................... 71

Clustering benefits ...................................................................................................... 72

Clustering management .............................................................................................. 72

Problems solved using clustering ............................................................................. 73

Page 4: Big data analytics graduation project documentation

4 | P a g e

Setting up single node cluster using Hadoop ........................................................... 74

Setting up multi node cluster using Hadoop ............................................................. 88

Integrating R with Hadoop ........................................................................................ 100

What is R? ..................................................................................................................... 100

Statistical features of R ................................................................................................ 101

The R environment ........................................................................................................ 101

Basic syntax of the R language ................................................................................... 101

R Packages .................................................................................................................... 102

PostgreSQL ............................................................................................................... 103

Connecting R with PostgreSQL using DBI ................................................................. 103

HBase ......................................................................................................................... 104

Connect R with HBase Using RHbase ....................................................................... 104

RHadoop .................................................................................................................... 105

Put It all together .......................................................................................................... 107

Connecting R & Hadoop .............................................................................................. 108

R Installation & Configuration .................................................................................. 109

Implement & Run mapreduce on R ........................................................................... 113

Big Data Case Study: Social Network #tags .......................................................... 120

Developing graphical user interface to install Hadoop ........................................... 125

Part Five

Problems ..................................................................................................................... 152

What is Next? ............................................................................................................. 155

Conclusion ................................................................................................................. 156

References ................................................................................................................. 157

Page 5: Big data analytics graduation project documentation

5 | P a g e

Figure 2.1: Hadoop framework ………………………………………………………….. 15

Figure 2.2: Hadoop multi node cluster ..……………………………………………… 17

Figure 2.3: Cassandra column family ..……………………………………………….. 23

Figure 2.4: Data stax cluster …..……………………………………………………….. 27

Figure 4.1: Computer clustering ………………………………………………………. 74

Figure 4.2: Name node web interface …………………………………………………. 92

Figure 4.3: Job tracker web interface …………………………………………………. 93

Figure 4.4: Task tracker web interface ………………………………………………… 94

Figure 4.5: Multi node clustering components ………………………………………. 95

Figure 4.6: Multi node clustering layers ………………………………………………. 98

Figure 4.7: RHadoop components …………………………………………………….. 115

Figure 4.8: Thrift dependency ………………………………………………………….. 117

Figure 4.9: Installing Rbase …………………………………………………………….. 118

Figure 4.10: Installing RMR package …………………………………………………. 119

Figure 4.11: Installing Rhdfs package ..………………………………………………. 119

Figure 4.12: Installing Rhbase package …...…………………………………………. 119

Figure 4.13: Standard R script …………………………………………………………. 120

Figure 4.14: Converting R script into MapReduce script ..…...…………………… 120

Figure 4.15: Results command lines ………………………………………………….. 120

Figure 4.16: Starting all Hadoop components ………………………………………. 121

Figure 4.17: R studio interface …………………………………………………………. 123

Figure 4.18: Word count example output …………………………………………….. 125

Figure 4.19: MapReduce word count example output ………………….………….. 125

List of Figures

Page 6: Big data analytics graduation project documentation

6 | P a g e

Figure 4.20: Data visualization using Rstudio ………………………………………. 127

Figure 4.21: Graphical user interface window ……………………………………… 129

Figure 4.22: Hadoop running from GUI ……………………………………………… 129

Figure 4.23: Running word count example …………………………………………. 130

Figure 4.24: Output File name entering window …………………………………… 130

Figure 4.25: Job progress ……………………………………………………………… 131

Figure 4.26: Completed MapReduce job …………………………………………...... 132

Figure 4.27: Jobs details ………..……………………………………………………… 133

Figure 4.28: MapReduce details ………………………………………………………. 134

Figure 4.29: Task trackers ……………………………………………………………… 135

Figure 4.30: Slave task tracker ………………………………………………………… 136

Figure 4.31: Master task tracker ……………………………………………………….. 137

Figure 4.32: Master task tracker completion .……………………………………….. 138

Figure 4.33: Master and slave file system …………………………………………… 139

Figure 4.34: Namenode interface ……………………………………………………… 140

Figure 4.35: Live nodes link …………………………………………………………….. 141

Figure 4.36: Live nodes link output …………………………………………………… 142

Figure 4.37: Output file ………………………………………………………………….. 143

Figure 4.38: Output file data ……………………………………………………………. 144

Figure 4.39: Ubuntu command lines ………………………………………………….. 145

Figure 4.40: Stopping Hadoop …………………………………………………………. 146

Page 7: Big data analytics graduation project documentation

7 | P a g e

We would like to express our deepest appreciation to all those who provided us the possibility to complete this project. A special gratitude we give to our final year project supervisors, Dr.Ali El-Bastwissy, Eng. Ali zidane whose contribution in stimulating suggestions and encouragement, helped us to coordinate our project.

Furthermore I would also like to acknowledge with much appreciation the crucial role of

our educational sponsor ‘ ’, that gave us the permission to use all required and necessary materials to complete the project “DataLytics”. And Special thanks go to all team mate,

Acknowledgment

Page 8: Big data analytics graduation project documentation

8 | P a g e

Big data is a collection of data sets so large and complex that it becomes difficult to

process using on-hand database management tools or traditional data processing

applications. The challenges include capture, storage, search, sharing, transfer,

analysis, and visualization. The trend to larger data sets is due to the additional

information derivable from analysis of a single large set of related data, requiring

instead "massively parallel software running on tens, hundreds, or even thousands of

servers in clustering.

We have compared between more than one tool to use it in clustering and perform

processing on this very large volume of data and produce analysis and visualization.

We decided to use Apache hadoop software for clustering and managing machines

then we integrated between Hadoop and R programming language to perform our

analytics and solve big data problem.

Abstract

Page 9: Big data analytics graduation project documentation

9 | P a g e

1.1 Introduction

In a broad range of application areas, data is being collected at unprecedented

scale. Decisions that previously were based on guesswork, or on painstakingly

constructed models of reality, can now be made based on the data itself. Such Big

Data analysis now drives nearly every aspect of our modern society, including mobile

services, retail, manufacturing, financial services, life sciences, and physical sciences.

Big Data has the potential to revolutionize not just research, but also education .A

recent detailed quantitative comparison of different approaches taken by 35 charter

schools in NYC has found that one of the top five policies correlated with measurable

academic effectiveness was the use of data to guide instruction. Imagine a world in

which we have access to a huge database where we collect every detailed measure of

every student's academic performance. This data could be used to design the most

effective approaches to education, starting from reading, writing, and math, to

advanced, college-level, courses. We are far from having access to such data, but

there are powerful trends in this direction. In particular, there is a strong trend for

massive Web deployment of educational activities, and this will generate an increasingly

large amount of detailed data about students' performance.

There have been persuasive cases made for the value of Big Data for urban

planning, intelligent transportation, environmental, energy, smart materials,

computational social sciences 2, financial systemic risk analysis, homeland security,

computer, and so on.

In 2010, enterprises and users stored more than 13 Exabyte’s of new data; this is

over 50,000 times the data in the Library of Congress. The potential value of global

personal location data is estimated to be $700 billion to end users, and it can result in

an up to 50% decrease in product development and assembly costs, according to a

recent McKinsey report. Analytical” experience will be needed in the US; furthermore,

1.5 million managers will need to become data-literate”. While the potential benefits of

Big Data are real and significant, and some initial successes have already been

achieved, there remain many technical challenges that must be addressed to fully

realize this potential. The sheer size of the data, of course, is a major challenge, and is

the one that is most easily recognized. However, there are others. Industry analysis

companies like to point out that there are challenges not just in Volume, but also in

PART One: Introduction to Big Data Analytics

Page 10: Big data analytics graduation project documentation

10 | P a g e

Variety and Velocity. By Variety, they usually mean heterogeneity of data types,

representation, and semantic interpretation. By Velocity, they mean both the rate at

which data arrive and the time in which it must be acted upon. While these three are

important, this short list fails to include additional important requirements such as

privacy and usability.

The analysis of Big Data involves multiple distinct phases as shown in the figure

below, each of which introduces challenges. Many people unfortunately focus just on

the analysis/modeling phase: while that phase is crucial, it is of little use without the

other phases of the data analysis pipeline. Even in the analysis phase, which has

received much attention, there are poorly understood complexities in the context of

multi-tenanted clusters where several users’ programs run concurrently. Many

significant challenges extend beyond the analysis phase. For example, Big Data has

to be managed in context, which may be noisy, heterogeneous and not include an

upfront model. Doing so raises the need to track provenance and to handle uncertainty

and error: topics that are crucial to success, and yet rarely mentioned in the same

breath as Big Data. Similarly, the questions to the data analysis pipeline will typically

not all be laid out in advance. We may need to figure out good questions based on the

data.

Doing this will require smarter systems and also better support for user

interaction with the analysis pipeline.

Solutions to problems such as this will not come from incremental improvements

to business as usual such as industry may make on its own. Rather, they require us to

fundamentally rethink how we manage data analysis.

Fortunately, existing computational techniques can be applied, either as is or with

some extensions, to at least some aspects of the Big Data problem. For example,

relational databases rely on the notion of logical data independence: users can think

about what they want to compute, while the system (with skilled engineers designing

those systems) determines how to compute it efficiently.

Similarly, the SQL standard and the relational data model provide a uniform,

powerful language to express many query needs and, in principle, allows customers to

choose between vendors, increasing competition.

1.2 Uses of big data

So the real issue is not that you are acquiring large amounts of data (because we

are clearly already in the era of big data). It's what you do with your big data that

matters. The hopeful vision for big data is that organizations will be able to harness

relevant data and use it to make the best decisions.

Page 11: Big data analytics graduation project documentation

11 | P a g e

Technologies today not only support the collection and storage of large amounts of

data, they provide the ability to understand and take advantage of its full value, which

helps organizations run more efficiently and profitably. For instance, with big data

and big data analytics, it is possible to:

Analyze millions of SKUs to determine optimal prices that maximize profit and

clear inventory.

Recalculate entire risk portfolios in minutes and understand future possibilities to

mitigate risk.

Mine customer data for insights that drive new strategies for customer

acquisition, retention, campaign optimization and next best offers.

Quickly identify customers who matter the most.

Generate retail coupons at the point of sale based on the customer's current and

past purchases, ensuring a higher redemption rate.

Send tailored recommendations to mobile devices at just the right time, while

customers are in the right location to take advantage of offers.

Analyze data from social media to detect new market trends and changes in

demand.

Use clickstream analysis and data mining to detect fraudulent behavior.

Determine root causes of failures, issues and defects by investigating user

sessions, network logs and machine sensors.

1.3 Technologies

A number of recent technology advancements are enabling organizations to make the

most of big data and big data analytics:

Cheap, abundant storage and server processing capacity.

Faster processors.

Affordable large-memory capabilities, such as Hadoop.

New storage and processing technologies designed specifically for large

data volumes, including unstructured data.

Parallel processing, clustering, MPP, virtualization, large grid

environments, high connectivity and high throughputs.

Cloud computing and other flexible resource allocation arrangements.

Big data technologies not only support the ability to collect large amounts of data, they

provide the ability to understand it and take advantage of its value. The goal of all

organizations with access to large data collections should be to harness the most

relevant data and use it for optimized decision making.

Page 12: Big data analytics graduation project documentation

12 | P a g e

1.4 Big data solutions

How can you make the most of all that data, now and in the future? It is a twofold

proposition. You can only optimize your success if you weave analytics into your big

data solution. But you also need analytics to help you manage the big data itself.

There are several key technologies that can help you get a handle on your big data, and

more important, extract meaningful value from it.

Information management for big data. Many vendors look at big data as a

discussion related to technologies such as Hadoop, NoSQL, etc. SAS

takes a more comprehensive data management/data governance

approach by providing a strategy and solutions that allow big data to be

managed and used more effectively.

High-performance analytics. By taking advantage of the latest parallel

processing power, high-performance analytics lets you do things you

never thought possible because the data volumes were just too large.

High-performance visual analytics. High-performance visual analytics lets

you explore huge volumes of data in mere seconds so you can quickly

identify opportunities for further analysis. Because it's not just that you

have big data, it's the decisions you make with the data that will create

organizational gains.

Flexible deployment options for big data. Flexible deployment models

bring choice. High-performance analytics from SAS can analyze billions of

variables, and those solutions can be deployed in the cloud (with SAS or

another provider), on a dedicated high-performance analytics appliance or

within your existing IT infrastructure, whichever best suits your

organization's requirements.

1.5 Big data characteristics

There are multiple characteristics of big data, but 3 stand out as defining

Characteristics:

• Huge volume of data (for instance, tools that can manage billions of rows and

billions of columns)

• Complexity of data types and structures, with an increasing volume of

unstructured data (80-90% of the data in existence is unstructured)….part of the

Digital Shadow or “Data Exhaust”

• Speed or velocity of new data creation

Page 13: Big data analytics graduation project documentation

13 | P a g e

In addition, the data, due to its size or level of structure, cannot be efficiently

analyzed using only traditional databases or methods.

There are many examples of emerging big data opportunities and solutions. Here

are a few: Netflix suggesting your next movie rental, dynamic monitoring of embedded

sensors in bridges to detect real-time stresses and longer-term erosion, and retailers

analyzing digital video streams to optimize product and display layouts and promotional

spaces on a store-by-store basis are a few real examples of how big data is involved in

our lives today.

These kinds of big data problems require new tools/technologies to store, manage

and realize the business benefit. The new architectures it necessitates are supported by

new tools, processes and procedures that enable organizations to create, manipulate

and manage these very large data sets and the storage environments that house them.

Page 14: Big data analytics graduation project documentation

14 | P a g e

2.1 HADOOP

2.1.1 What is Hadoop

Hadoop is a tool to implement a mapreduce function.

2.1.2 Hadoop history

A little history… Hadoop was born out of a need to process big data, as the

amount of generated data continued to rapidly increase. As the Web generated more

and more information, it was becoming quite challenging to index the content, so

Google created MapReduce in 2004, then Yahoo! created Hadoop as a way to

implement the MapReduce function.

So we can define hadoop as follow : Hadoop is a free, Java-based programming

framework that supports the processing of large data sets in a distributed computing

environment (across clusters of commodity servers). It is part of the Apache project

sponsored by the Apache Software Foundation.

Figure 2.1 Hadoop framework

PART Two: Big Data management System

Page 15: Big data analytics graduation project documentation

15 | P a g e

2.1.3 Hadoop features :

Hadoop was inspired by and Google File System (GFS) papers and

MapReduce in which an application is broken down into numerous

small parts. Any of these parts (also called fragments or blocks)

can be run on any node in the cluster

It is designed to scale up from a single server to thousands of

machines, with a very high degree of fault tolerance. Rather than

relying on high-end hardware, the flexibility of these clusters

comes from the software’s ability to detect and handle failures at

the application layer.

Hadoop makes it possible to run applications on systems with

thousands of nodes involving thousands of terabytes. Its distributed

file system facilitates rapid data transfer rates among nodes and

allows the system to continue operating uninterrupted in case of a

node failure. This approach lowers the risk of catastrophic system

failure, even if a significant number of nodes become inoperative.

Hadoop implements a computational paradigm named

MapReduce[1], where the application is divided into many small

fragments of work, each of which may be executed or re-executed

on any node in the cluster

it provides a distributed file system that stores data on the compute

nodes, providing very high aggregate bandwidth across the cluster

Both map/reduce and the distributed file system are designed so

that node failures are automatically handled by the framework

2.1.4 Structure of Apache hadoop platform (Hadoop ecosystem) :

Hadoop consists of the Hadoop Common which provides access to the file

systems supported by Hadoop. The Hadoop Common package contains the necessary

JAR files and scripts needed to start Hadoop. The package also provides source code,

documentation and a contribution section that includes projects from the Hadoop

Community.

For effective scheduling of work, every Hadoop-compatible file system should

provide location awareness: the name of the rack (more precisely, of the network

switch) where a worker node is. Hadoop applications can use this information to run

work on the node where the data is, and, failing that, on the same rack/switch, reducing

backbone traffic. The Hadoop Distributed File System (HDFS) uses this method when

replicating data to try to keep different copies of the data on different racks. The goal is

Page 16: Big data analytics graduation project documentation

16 | P a g e

to reduce the impact of a rack power outage or switch failure, so that even if these

events occur, the data may still be readable.

A small Hadoop cluster will include a single master and multiple worker nodes.

The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A

slave or worker node acts as both a DataNode and TaskTracker, though it is possible to

have data-only worker nodes, and compute-only worker nodes. These are normally

used only in nonstandard applications.Hadoop requires JRE 1.6 or higher. The standard

start-up and shutdown scripts require ssh to be set up between nodes in the cluster.

In a larger cluster, the HDFS is managed through a dedicated NameNode server

to host the file system index, and a secondary NameNode that can generate snapshots

of the namenode's memory structures, thus preventing file-system corruption and

reducing loss of data. Similarly, a standalone JobTracker server can manage job

scheduling. In clusters where the Hadoop MapReduce engine is deployed against an

alternate file system, the NameNode, secondary NameNode and DataNode architecture

of HDFS is replaced by the file-system-specific equivalent.

Figure 1.2 Multi Node hadoop cluster

Page 17: Big data analytics graduation project documentation

17 | P a g e

2.1.5 Hadoop Distributed File System

HDFS is a distributed, scalable, and portable file system written in Java for the

Hadoop framework. Each node in a Hadoop instance typically has a single namenode;

a cluster of datanodes form the HDFS cluster. The situation is typical because each

node does not require a datanode to be present. Each datanode serves up blocks of

data over the network using a block protocol specific to HDFS. The file system uses the

TCP/IP layer for communication. Clients use RPC to communicate between each other.

HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple

machines. It achieves reliability by replicating the data across multiple hosts, and hence

does not require RAID storage on hosts. With the default replication value, 3, data is

stored on three nodes: two on the same rack, and one on a different rack. Data nodes

can talk to each other to rebalance data, to move copies around, and to keep the

replication of data high. HDFS is not fully POSIX compliant, because the requirements

for a POSIX file system differ from the target goals for a Hadoop application. The

tradeoff of not having a fully POSIX-compliant file system is increased performance for

data throughput. HDFS was designed to handle very large files.

HDFS has recently added high-availability capabilities, allowing the main

metadata server (the namenode) to be failed over manually to a backup in the event of

failure. Automatic fail-over is being developed as well. Additionally, the file system

includes what is called a secondary namenode, which misleads some people into

thinking that when the primary namenode goes offline, the secondary namenode takes

over. In fact, the secondary namenode regularly connects with the primary namenode

and builds snapshots of the primary namenode's directory information, which is then

saved to local or remote directories. These checkpointed images can be used to restart

a failed primary namenode without having to replay the entire journal of file-system

actions, then to edit the log to create an up-to-date directory structure. Because the

namenode is the single point for storage and management of metadata, it can be a

bottleneck for supporting a huge number of files, especially a large number of small

files. HDFS Federation is a new addition that aims to tackle this problem to a certain

extent by allowing multiple name spaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job tracker and

task tracker. The job tracker schedules map or reduce jobs to task trackers with an

awareness of the data location. An example of this would be if node A contained data

(x,y,z) and node B contained data (a,b,c). Then the job tracker will schedule node B to

perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map

or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network

and prevents unnecessary data transfer. When Hadoop is used with other file systems

this advantage is not always available. This can have a significant impact on job-

completion times, which has been demonstrated when running data-intensive jobs.

Page 18: Big data analytics graduation project documentation

18 | P a g e

Another limitation of HDFS is that it cannot be mounted directly by an existing

operating system. Getting data into and out of the HDFS file system, an action that often

needs to be performed before and after executing a job, can be inconvenient. A

Filesystem in Userspace (FUSE) virtual file system has been developed to address this

problem, at least for Linux and some other Unix systems.

File access can be achieved through the native Java API, the Thrift API to

generate a client in the language of the users' choosing (C++, Java, Python, PHP,

Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line

interface, or browsed through the HDFS-UI webapp over HTTP.

2.1.6 Other Filesystems

By May 2011, the list of supported filesystems included:

HDFS: Hadoop's own rack-aware filesystem. This is designed to scale to

tens of petabytes of storage and runs on top of the filesystems of the

underlying operating systems.

Amazon S3 filesystem. This is targeted at clusters hosted on the Amazon

Elastic Compute Cloud server-on-demand infrastructure. There is no rack-

awareness in this filesystem, as it is all remote.

CloudStore (previously Kosmos Distributed File System), which is rack-

aware.

FTP Filesystem: this stores all its data on remotely accessible FTP

servers.

Read-only HTTP and HTTPS file systems.

Hadoop can work directly with any distributed file system that can be mounted by the

underlying operating system .

JobTracker and TaskTracker: the MapReduce engine

Above the file systems comes the MapReduce engine, which consists of one

JobTracker, to which client applications submit MapReduce jobs. The JobTracker

pushes work out to available TaskTracker nodes in the cluster, striving to keep the work

as close to the data as possible. With a rack-aware filesystem, the JobTracker knows

which node contains the data, and which other machines are nearby. If the work cannot

be hosted on the actual node where the data resides, priority is given to nodes in the

same rack. This reduces network traffic on the main backbone network. If a

TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on

each node spawns off a separate Java Virtual Machine process to prevent the

TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent

from the TaskTracker to the JobTracker every few minutes to check its status. The Job

Page 19: Big data analytics graduation project documentation

19 | P a g e

Tracker and TaskTracker status and information is exposed by Jetty and can be viewed

from a web browser.

If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost.

Hadoop version 0.21 added some checkpointing to this process; the JobTracker records

what it is up to in the filesystem. When a JobTracker starts up, it looks for any such

data, so that it can restart work from where it left off. In earlier versions of Hadoop, all

active work was lost when a JobTracker restarted.

2.1.7 Known limitations of this approach are:

The allocation of work to TaskTrackers is very simple. Every TaskTracker has a

number of available slots (such as "4 slots"). Every active map or reduce task

takes up one slot. The Job Tracker allocates work to the tracker nearest to the

data with an available slot. There is no consideration of the current system load

of the allocated machine, and hence its actual availability.

If one TaskTracker is very slow, it can delay the entire MapReduce job -

especially towards the end of a job, where everything can end up waiting for the

slowest task. With speculative execution enabled, however, a single task can be

executed on multiple slave nodes.

2.1.8 Scheduling

By default Hadoop uses FIFO, and optional 5 scheduling priorities to schedule jobs

from a work queue. In version 0.19 the job scheduler was refactored out of the

JobTracker, and added the ability to use an alternate scheduler (such as the Fair

scheduler or the Capacity scheduler).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to

provide fast response times for small jobs and QoS for production jobs. The fair

scheduler has three basic concepts.

Jobs are grouped into Pools.

Each pool is assigned a guaranteed minimum share.

Excess capacity is split between jobs.

By default, jobs that are uncategorized go into a default pool. Pools have to specify the

minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Page 20: Big data analytics graduation project documentation

20 | P a g e

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports

several features which are similar to the fair scheduler.[23]

Jobs are submitted into queues.

Queues are allocated a fraction of the total resource capacity.

Free resources are allocated to queues beyond their total capacity.

Within a queue a job with a high level of priority will have access to the queue's

resources.

There is no preemption once a job is running.

2.1.9 Other applications

The HDFS file system is not restricted to MapReduce jobs. It can be used for

other applications, many of which are under development at Apache. The list includes

the HBase database, the Apache Mahout machine learning system, and the Apache

Hive Data Warehouse system. Hadoop can in theory be used for any sort of work that is

batch-oriented rather than real-time, that is very data-intensive, and able to work on

pieces of the data in parallel. As of October 2009, commercial applications of

Hadoop[24] included:

Log and/or clickstream analysis of various kinds

Marketing analytics

Machine learning and/or sophisticated data mining

Image processing

Processing of XML messages

Web crawling and/or text processing

General archiving, including of relational/tabular data, e.g. for compliance

Page 21: Big data analytics graduation project documentation

21 | P a g e

2.1.10 Advantages of Hadoop:

enables the distributed processing of large data sets across clusters of

commodity servers

It is designed to scale up from a single server to thousands of machines, with a

very high degree of fault tolerance

relying on high-end hardware, the resiliency of these clusters comes from the

software’s ability to detect and handle failures at the application layer

Hadoop's popularity is partly due to the fact that it is used by some of the world's

largest Internet businesses to analyze unstructured data. Hadoop enables

distributed applications to handle data volumes in the order of thousands of

exabytes

analyzing large data sets(search algorithms, market risk analysis, data mining

on online retail data, and analytics on user behavior data)

parallel data processing

Hadoop's scalability makes it attractive to businesses because of the

exponentially increasing nature of the data they handle. Another core strength of

Hadoop is that it can handle structured as well as unstructured data, from a

variable number of sources.

Hadoop has been demonstrated on clusters of up to 4000 nodes. Sort

performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes

around 1.8 hours)

Hadoop enables a computing solution that is:

Scalable– New nodes can be added as needed, and added without needing

to change data formats, how data is loaded, how jobs are written, or the

applications on top.

Cost effective– Hadoop brings massively parallel computing to commodity

servers. The result is a sizeable decrease in the cost per terabyte of storage,

which in turn makes it affordable to model all your data.

Flexible– Hadoop is schema-less, and can absorb any type of data,

structured or not, from any number of sources. Data from multiple sources

can be joined and aggregated in arbitrary ways enabling deeper analyses

than any one system can provide.

Fault tolerant– When you lose a node, the system redirects work to another

location of the data and continues processing without missing a beat.

Page 22: Big data analytics graduation project documentation

22 | P a g e

2.2 Cassandra

Apache Cassandra

Is an open source database management system designed to handle a

very large amount of data spread out across many servers while providing

many services without single point of failure.

It is a No-SQl solution was initially developed by facebook

Cassandra provides structured key-value-store which is grouped into

column families which is fixed when the column family created and then

can be expanded by adding any number of columns at any time

Cassandra supports data stax, Acunu analytics.

Tools for Cassandra:

Data browsers: chiton, Cassandra GUI, toad for cloud databases.

Administration tools: OpsCenter to manage Cassandra clusters,

Cassandra cluster admin to help people administrate their Cassandra

clusters

Cassandra data model (Column family):

Figure 2.3 Cassandra column family

Each row identified by a unique row key and columns identified by column name and

contains values in it.

2.2.1 Advantages of apache Cassandra:

Decentralized: every node in the cluster has the same role and data is

distributed across the cluster, every node contains different data.

Supports replication and multi data center replication: Replication

strategies are configurable. Cassandra is designed as a distributed

system, for deployment of large numbers of nodes across multiple data

centers. Key features of Cassandra’s distributed architecture are

Page 23: Big data analytics graduation project documentation

23 | P a g e

specifically tailored for multiple-data center deployment, for redundancy,

for failover and disaster recovery.

Scalability: new machines can be added with no down time or interruption

to application.

Fault tolerance: data is automatically replicated to multiple nodes for fault

tolerance replication across multiple data centers, failed nodes can be

replaced with no down time.

Tunable consistency: writes never fail to block for all replicas to be

readable.

Map reduce support: Cassandra has hadoop integration with map reduce

support, there is support also for apache pig and apache hive.

Query language: CQL, language drivers are available for Java(JDBC),

Python(DBAPI2), NODE.JS

Cassandra has a lot of high level client libraries for Python, Java, Dot Net,

Ruby, PhP, Perl, and C++.

Limitations:

Single column value may not be larger than 2GB.

Maximum number of columns per row is 2 billion columns.

The key and column names must be under 64 KB.

2.2.2 Data Stax

About DataStax Enterprise:

DataStax Enterprise is a big data platform built on Apache Cassandra that

manages real-time, analytics, and enterprise search data. DataStax Enterprise

leverages Cassandra, Apache Hadoop, and Apache Solr to shift your focus from

the data infrastructure to using your data strategically.

New Features in DataStax Enterprise 2.2

DataStax Enterprise 2.2 introduces these features:

Updates Cassandra 1.0 to Cassandra 1.1.5 - In Cassandra 1.1, key

improvements have been made in the areas of CQL, performance, and

management ease of use.

Support for Installation on the HP Cloud - In addition to Amazon Elastic

Compute Cloud, DataStax now supports installation of DataStax Enterprise in

the HP Cloud environment. You can install DataStax on Ubuntu 11.04 Natty

Narwhal and Ubuntu 11.10 Oneiric Ocelot.

Support for SUSE Enterprise Linux - DataStax Enterprise adds SUSE

Enterprise Linux 11.2 and 11.4 to its list of supported platforms.

Page 24: Big data analytics graduation project documentation

24 | P a g e

Improved Solr Shard Selection algorithm - Previously, for each queried token

range, Cassandra selected the first closest node to the node issuing the

query within that range. Equally distant nodes were always tried in the same

order, so that resulted in one or more nodes being hotspotted and often

selecting more shards than actually needed. The improved algorithm uses a

shuffling technique to balance the load, and also attempts to minimize the

number of shards queried as well as the amount of data transferred from non-

local nodes.

Capability to Set Solr Column Expiration - You can update a DSE Search

column to set a column expiration date using CQL, which eventually causes

removal of the column from the database.

2.2.3 Key Features of DataStax Enterprise

The key features of DataStax Enterprise include:

Production Certified Cassandra – DataStax Enterprise contains a fully tested,

benchmarked, and certified version of Apache Cassandra that is suitable for

mission-critical production deployments.

No Single Point of Failure - In the Hadoop Distributed File System (HDFS)

master/slave architecture, the NameNode entry point into the cluster stores

configuration metadata about the cluster. If the NameNode fails, the Hadoop

system goes down. DataStax Enterprise improves upon this architecture by

making nodes peers. Being peers, any node in the cluster can load data files,

and any analytics node can assume the responsibilities of job tracker for

MapReduce jobs.

Reserve Job Tracker - DataStax Enterprise keeps a job tracker in reserve to

take over in the event of a problem that would affect availability.

Multiple Job Trackers - In the Cassandra File System (CassandraFS), you

can run one or more job tracker services across multiple data centers and

create multiple CassandraFS keyspaces per data center. Using this capability

has performance, data replication, and other benefits.

Hadoop MapReduce using Multiple Cassandra File Systems - CassandraFS

is an HDFS-compatible storage layer. DataStax replaces HDFS with

CassandraFS to run MapReduce jobs on Cassandra's peer-to-peer, fault-

tolerant, and scalable architecture. In DataStax Enterprise 2.1 and later, you

cancreate additional CassandraFS's to organize and optimize Hadoop data.

Analytics Without ETL - Using DataStax Enterprise, you run MapReduce jobs

directly against your data in Cassandra. You can even perform real-time and

analytics workloads at the same time without one workload affecting the

performance of the other. Starting some cluster nodes as Hadoop analytics

Page 25: Big data analytics graduation project documentation

25 | P a g e

nodes and others as pure Cassandra real-time nodes automatically replicates

data between nodes.

Elastic Workload Re-provisioning - Existing nodes can be re-provisioned to

assume a different workload. For example, you can change two real-

time/transactional nodes to analytics nodes during off-peak hours and then

return them to the original configuration after the analytics tasks have

completed.

Streamlined Setup and Operations - In Hadoop, you have to set up different

mode configurations: stand-alone mode or pseudo-distributed mode for a

single node setup, or cluster mode for a multi-node configuration. In DataStax

Enterprise, you configure only one mode (cluster mode) regardless of the

number of nodes.

Hive Support - Hive, a data warehouse system, facilitates data

summarization, ad-hoc queries, and the analysis of large datasets stored in

Hadoop-compatible file systems. Any JDBC compliant user interface

connects to Hive from the server. Using the Cassandra-enabled Hive

MapReduce client in DataStax Enterprise, you project a relational structure

onto Hadoop data in the Cassandra file systems, and query the data using a

SQL-like language. Cassandra nodes share the Hive metastore automatically,

eliminating repetitive HIVE configuration steps.

Pig Support - The Cassandra-enabled Pig MapReduce client included with

DataStax Enterprise is a high-level platform for creating MapReduce

programs used with Hadoop. You can analyze large data sets, running jobs in

MapReduce mode and Pig programs directly on data stored in Cassandra.

Enterprise Search Capabilities - DataStax Enterprise Search fully integrates

Apache Solr for ad-hoc querying of data, full-text search, hit highlighting,

multiple search attributes, geo-spatial search, and for searching rich

documents, such as PDF and Microsoft Word, and more.

Migration of RDBMS data - Apache Sqoop in DataStax Enterprise provides

easy migration of RDBMS data, such as Oracle, Microsoft SQL Server,

MySQL, Sybase, and DB2 RDBMS, and non-relational data sources, such as

NoSQL into the DataStax Enterprise server.

Runtime Logging - DataStax Enterprise transfers log-based data directly into

the server using log4j. Apache log4j is a Java-based logging framework that

provides runtime application feedback and control over the size of log

statements. Cassandra Appender can store the log4j messages in the

Cassandra table-like structure for in-depth analysis using the Hadoop and

Solr capabilities.

Support for Mahout - The Hadoop component, Apache Mahout, incorporated

into DataStax Enterprise 2.1 and later offers machine learning libraries.

Page 26: Big data analytics graduation project documentation

26 | P a g e

Machine learning improves a system, such as the one that recreates the

Google priority inbox, based on past experience or examples.

Full Integration with DataStax OpsCenter - Using DataStax OpsCenter, you

can monitor, administer, and configure one or more DataStax Enterprise

clusters in an easy-to-use graphical interface. Schedule automatic backups,

explore Cassandra data, and see detailed health and status information about

clusters, such as the up or down status of nodes, graphs of performance

metrics, storage limitations, and progress of Hadoop MapReduce jobs.

Figure 2.4 Data stax clustering

Deployment

Configuring Replication »

Production Deployment Planning

This section provides guidelines for determining the size of your production

Cassandra cluster based on the data you plan to store.

Planning includes the following activities:

Prerequisites

Before starting to plan a production cluster, you need:

A good understanding of the size of the raw data you plan to store.

A good estimate of your typical application workload.

A plan to model your data in Cassandra (number of column families, rows,

columns per row, and so on).

Selecting Hardware for Enterprise Implementations

Page 27: Big data analytics graduation project documentation

27 | P a g e

As with any application, choosing appropriate hardware depends on selecting the

right balance of the following resources: memory, CPU, disks, number of nodes,

and network.

Note

Hadoop and Solr nodes require their own nodes/disks and have specific

hardware requirements. See the Hadoop and Solr documentation for more

information when determining your capacity requirements.

Memory

The more memory a Cassandra node has, the better read performance. More

RAM allows for a larger file system cache and reduces disk I/O for reads. The

ideal amount of RAM depends on the anticipated size of your hot data.

DataStax recommends the following memory requirements:

For dedicated hardware, a minimum of 8GB of RAM is needed. For most

implementations you should use 16GB to 32GB.

Java heap space should be set according to Cassandra 1.1 guidelines.

For a virtual environment use a minimum of 4GB, such as Amazon EC2

Large instances. For production clusters with a healthy amount of traffic, 8GB

is more common.

For Solr and Hadoop nodes, use 32GB or more of total RAM.

CPU

Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-

bound. Cassandra is highly concurrent and uses as many CPU cores as

available.

For dedicated hardware, 8-core processors are the current price-performance

sweet spot.

For virtual environments, consider using a provider that allows CPU bursting,

such as Rackspace Cloud Servers.

Disk

What you need for your environment depends a lot on the usage, so it's

important to understand the mechanism. Cassandra writes data to disk for two

purposes:

All data is appended to the commit log for durability.

When thresholds are reached, Cassandra periodically flushes in-memory data

structures (memtables) to immutable SSTable data files for storage of column

family data.

Commit logs receive every write made to a Cassandra node, but are only read

during node start up. Commit logs are purged after the corresponding data is

flushed. Conversely, SSTable (data file) writes occur asynchronously and are

read during client look-ups. Additionally, SSTables are periodically compacted.

Page 28: Big data analytics graduation project documentation

28 | P a g e

Compaction improves performance by merging and rewriting data and discarding

old data. However, during compaction (or node repair), disk utilization and data

directory volume can substantially increase. For this reason, DataStax

recommends leaving an adequate amount of free disk space available on a node

(50% [worst case] for tiered compaction, 10% for leveled compaction).

2.2.4 Recommendations:

DataStax neither supports nor recommends using Network Attached Storage

(NAS) because of performances issues, such as network saturation, I/O

overload, pending-task swamp, excessive memory usage, and disk

contention.

When choosing disks, consider both capacity (how much data you plan to

store) and I/O (the write/read throughput rate). Most workloads are best

served by using less expensive SATA disks and scaling disk capacity and I/O

by adding more nodes (with more RAM).

Solid-state drives (SSDs) are a valid choice for Cassandra. Cassandra's

sequential, streaming write patterns minimize the undesirable effects of write

amplification associated with SSDs.

Ideally Cassandra needs at least two disks, one for the commit log and the

other for the data directories. At a minimum the commit log should be on its

own partition.

Commit log disk - this disk does not need to be large, but it should be fast

enough to receive all of your writes as appends (sequential I/O).

Data disks - use one or more disks and make sure they are large enough for

the data volume and fast enough to both satisfy reads that are not cached in

memory and to keep up with compaction.

RAID - the compaction process can temporarily require up to double the

normal data directory volume. This means when approaching 50% of disk

capacity, you should use RAID 0 or RAID 10 for your data directory volumes.

RAID also helps smooth out I/O hotspots within a single SSTable.

Use RAID0 if disk capacity is a bottleneck and rely on Cassandra's

replication capabilities for disk failure tolerance. If you lose a disk on a

node, you can recover lost data through Cassandra's built-in repair.

Use a setra setting of 512, especially on Amazon EC2 RAID0 devices.

See Optimum blockdev --setra Settings for RAID.

Use RAID10 to avoid large repair operations after a single disk failure,

or if you have disk capacity to spare.

Because data is stored in the memtable, generally RAID is not needed

for the commit log disk, but if you need the extra redundancy, use

RAID 1.

Page 29: Big data analytics graduation project documentation

29 | P a g e

Extended file systems - On ext2 or ext3, the maximum file size is 2TB

even using a 64-bit kernel. On ext4 it is 16TB.

Because Cassandra can use almost half your disk space for a single file, use

XFS when raiding large disks together, particularly if using a 32-bit kernel. XFS

file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-

bit.

Number of Nodes

The amount of data on each disk in the array isn't as important as the total size

per node. Using a greater number of smaller nodes is better than using fewer

larger nodes because of potential bottlenecks on larger nodes during

compaction.

Network

Since Cassandra is a distributed data store, it puts load on the network to handle

read/write requests and replication of data across nodes. Be sure to choose

reliable, redundant network interfaces and make sure that your network can

handle traffic between nodes without bottlenecks.

Recommended bandwith is 1000 Mbit/s (Gigabit) or greater.

Bind the Thrift interface (listen_address) to a specific NIC (Network

Interface Card).

Bind the RPC server interface (rpc_address) to another NIC.

Cassandra is efficient at routing requests to replicas that are geographically

closest to the coordinator node handling the request. Cassandra will pick a

replica in the same rack if possible, and will choose replicas located in the same

data center over replicas in a remote data center.

Ports

If using a firewall, make sure that nodes within a cluster can reach each other.

See Configuring Firewall Port Access.

Planning an Amazon EC2 Cluster

DataStax provides an Amazon Machine Image (AMI) to allow you to quickly

deploy a multi-node Cassandra cluster on Amazon EC2. The DataStax AMI

initializes all nodes in one availability zone using the SimpleSnitch. If you want an

EC2 cluster that spans multiple regions and availability zones, do not use the

DataStax AMI. Instead, initialize your EC2 instances for each Cassandra node

and then configure the cluster as a multiple data-center cluster.

Use the following guidelines when setting up your cluster:

For most production clusters, use or Extra Large instances with local

storage.

Page 30: Big data analytics graduation project documentation

30 | P a g e

Amazon Web Service recently reduced the number of default ephemeral disks

attached to the image from four to two. Performance will be slower for new nodes

unless you manually attach the additional two disks; see Amazon EC2 Instance

Store.

For low to medium data throughput production clusters, use Large

instances with local storage (which are generally adequate for about a

year).

RAID0 the ephemeral disks, and put both the data directory and the

commit log on that volume. This has proved to be better in practice than

putting the commit log on the root volume (which is also a shared

resource).

For data redundancy, consider deploying your cluster across multiple

availability zones or using EBS volumes to store your backup files.

EBS volumes are not recommended for Cassandra data volumes. Their

network performance and disk I/O are not good fits for Cassandra for the

following reasons:

EBS volumes contend directly for network throughput with standard

packets. This means that EBS throughput is likely to fail if you saturate

a network link.

EBS volumes have unreliable performance. I/O performance can be

exceptionally slow, causing the system to backload reads and writes

until the entire cluster becomes unresponsive.

Adding capacity by increasing the number of EBS volumes per host

does not scale. You can easily surpass the ability of the system to

keep effective buffer caches and concurrently serve requests for all of

the data it is responsible for managing.

Calculating Usable Disk Capacity

To calculate how much data your Cassandra nodes can hold, calculate the

usable disk capacity per node and then multiply that by the number of nodes in

your cluster. Remember that in a production cluster, you will typically have your

commit log and data directories on different disks. This calculation is for

estimating the usable capacity of the data volume.

Start with the raw capacity of the physical disks:

raw_capacity = disk_size * number_of_disks

Account for file system formatting overhead (roughly 10 percent) and the RAID

level you are using. For example, if using RAID-10, the calculation would be:

(raw_capacity * 0.9) / 2 = formatted_disk_space

During normal operations, Cassandra routinely requires disk capacity for

compaction and repair operations. For optimal performance and cluster health,

Page 31: Big data analytics graduation project documentation

31 | P a g e

DataStax recommends that you do not fill your disks to capacity, but run at 50-80

percent capacity. With this in mind, calculate the usable disk space as follows

(example below uses 50%):

formatted_disk_space * 0.5 = usable_disk_space

Calculating User Data Size

Typically in data storage systems, the size of your raw data will be larger once it

is loaded into the database due to storage overhead. On average, raw data is

about 2 times larger on disk after it is loaded into the Cassandra, but could vary

in either direction depending on the characteristics of your data and column

families. The calculations in this section account for data persisted to disk, not for

data stored in memory.

Column Overhead - Every column in Cassandra incurs 15 bytes of

overhead. Since each row in a column family can have different column

names as well as differing numbers of columns, metadata is stored

for each column. For counter columns and expiring columns, add an

additional 8 bytes (23 bytes column overhead). So the total size of

a regular column is:

total_column_size = column_name_size + column_value_size + 15

Row Overhead - Just like columns, every row also incurs some overhead

when stored on disk. Every row in Cassandra incurs 23 bytes of overhead.

Primary Key Index - Every column family also maintains a primary index of

its row keys. Primary index overhead becomes more significant when you

have lots of skinny rows. Sizing of the primary row key index can be

estimated as follows (in bytes):

primary_key_index = number_of_rows * (32 + average_key_size)

Replication Overhead - The replication factor obviously plays a role in how

much disk capacity is used. For a replication factor of 1, there is no

overhead for replicas (as only one copy of your data is stored in the

cluster). If replication factor is greater than 1, then your total data storage

requirement will include replication overhead.

replication_overhead = total_data_size * (replication_factor - 1)

Page 32: Big data analytics graduation project documentation

32 | P a g e

3.1 HIVE Query language

3.1.1 What is Hive

Hive is a data warehousing infrastructure based on the Hadoop. Hadoop

provides massive scale out and fault tolerance capabilities for data storage and

processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and

analysis of large volumes of data. It provides a simple query language called Hive QL,

which is based on SQL and which enables users familiar with SQL to do ad-hoc

querying, summarization and data analysis easily. At the same time, Hive QL also

allows traditional map/reduce programmers to be able to plug in their custom mappers

and reducers to do more sophisticated analysis that may not be supported by the built-

in capabilities of the language.

3.1.2 What Hive is NOT

Hadoop is a batch processing system and Hadoop jobs tend to have high latency

and incur substantial overheads in job submission and scheduling. As a result - latency

for Hive queries is generally very high (minutes) even when data sets involved are very

small (say a few hundred megabytes). As a result it cannot be compared with systems

such as Oracle where analyses are conducted on a significantly smaller amount of data

but the analyses proceed much more iteratively with the response times between

iterations being less than a few minutes. Hive aims to provide acceptable (but not

optimal) latency for interactive data browsing, queries over small data sets or test

queries.

Hive is not designed for online transaction processing and does not offer real-

time queries and row level updates. It is best used for batch jobs over large sets of

immutable data (like web logs).

In the following sections we provide a tutorial on the capabilities of the system. We start

by describing the concepts of data types, tables and partitions (which are very similar to

what you would find in a traditional relational DBMS) and then illustrate the capabilities

of the QL language with the help of some examples.

3.1.3 ADVANTAGES OF USING APACHE HIVE

Fits the low level interface requirement of Hadoop perfectly.

PART Three: Big Data QL (Hive QL vs CQL)

Page 33: Big data analytics graduation project documentation

33 | P a g e

Supports external tables which make it possible to process data without actually

storing in HDFS.

It has a rule based optimizer for optimizing logical plans.

Supports partitioning of data at the level of tables to improve performance.

Metastore or Metadata store is a big plus in the architecture which makes the

lookup easy.

3.1.4 DISADVANTAGES OF USING APACHE HIVE

No support for update and delete.

No support for singleton inserts. Data is required to be loaded from a file using

LOAD command.

No access control implementation.

Correlated sub queries are not supported.

3.1.5 APACHE HIVE QL Syntax

Language capabilities

Hive query language provides the basic SQL like operations. These operations work on

tables or partitions. These operations are:

Ability to filter rows from a table using a where clause.

Ability to select certain columns from the table using a select clause.

Ability to do equi-joins between two tables.

Ability to evaluate aggregations on multiple "group by" columns for the data

stored in a table.

Ability to store the results of a query into another table.

Ability to download the contents of a table to a local (e.g., nfs) directory.

Ability to store the results of a query in a hadoop dfs directory.

Ability to manage tables and partitions (create, drop and alter).

Ability to plug in custom scripts in the language of choice for custom map/reduce

jobs.

What's New at Hive QL?

At Data Units:

· Partitions: Each Table can have one or more partition Keys which determines

how the data is stored. Partitions - apart from being storage units - also allow the user to

efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of

type STRING and country_partition of type STRING. Each unique value of the partition

Page 34: Big data analytics graduation project documentation

34 | P a g e

keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a

partition of the page_views table. Therefore, if you run analysis on only the "US" data

for 2009-12-23, you can run that query only on the relevant partition of the table thereby

speeding up the analysis significantly. Note however, that just because a partition is

named 2009-12-23 does not mean that it contains all or only data from that date;

partitions are named after dates for convenience but it is the user's job to guarantee the

relationship between partition name and data content!). Partition columns are virtual

columns, they are not part of the data itself but are derived on load.

· Buckets (or Clusters): Data in each partition may in turn be divided into Buckets

based on the value of a hash function of some column of the Table. For example the

page_views table may be bucketed by userid, which is one of the columns, other than

the partitions columns, of the page_view table. These can be used to efficiently sample

the data.

At Type System:

Complex Types

Complex Types can be built up from primitive types and other composite types using:

Structs: the elements within the type can be accessed using the DOT (.) notation.

For example, for a column c of type STRUCT {a INT; b INT} the a field is

accessed by the expression c.a

Maps (key-value tuples): The elements are accessed using ['element name']

notation. For example in a map M comprising of a mapping from 'group' -> gid

the gid value can be accessed using M['group']

Arrays (indexable lists): The elements in the array have to be in the same type.

Elements can be accessed using the [n] notation where n is an index (zero-

based) into the array. For example for an array A having the elements ['a', 'b', 'c'],

A[1] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with

arbitrary levels of nesting can be created. For example, a type User may comprise of

the following fields:

gender - which is a STRING.

active - which is a BOOLEAN.

How to Write a Hive Query?

Creating Tables:

An example statement that would create the page_view table mentioned above would

be like:

Page 35: Big data analytics graduation project documentation

35 | P a g e

In this example the columns of the table are specified with the corresponding

types. Comments can be attached both at the column level as well as at the table level.

Additionally the partitioned by clause defines the partitioning columns which are

different from the data columns and are actually not stored with the data. When

specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-

A) as the field delimiter and newline as the row delimiter.

The field delimiter can be parametrized if the data is not in the above format as

illustrated in the following example:

The row delimiter currently cannot be changed since it is not determined by Hive

but Hadoop. e delimiters.

It is also a good idea to bucket the tables on certain columns so that efficient

sampling queries can be executed against the data set. If bucketing is absent, random

sampling can still be done on the table but it is not efficient as the query has to scan all

the data. The following example illustrates the case of the page_view table that is

bucketed on the userid column:

CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '1'

STORED AS SEQUENCEFILE;

CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

STORED AS SEQUENCEFILE;

Page 36: Big data analytics graduation project documentation

36 | P a g e

In the example above, the table is clustered by a hash function of userid into 32

buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an

organization allows the user to do efficient sampling on the clustered column - in this

case userid. The sorting property allows internal operators to take advantage of the

better-known data structure while evaluating queries with greater efficiency.

CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '1'

COLLECTION ITEMS TERMINATED BY '2'

MAP KEYS TERMINATED BY '3'

STORED AS SEQUENCEFILE;

CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

friends ARRAY<BIGINT>, properties MAP<STRING, STRING>

ip STRING COMMENT 'IP Address of the User')

COMMENT 'This is the page view table'

PARTITIONED BY(dt STRING, country STRING)

CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '1'

COLLECTION ITEMS TERMINATED BY '2'

MAP KEYS TERMINATED BY '3'

STORED AS SEQUENCEFILE;

Page 37: Big data analytics graduation project documentation

37 | P a g e

In this example the columns that comprise of the table row are specified in a

similar way as the definition of types. Comments can be attached both at the column

level as well as at the table level. Additionally the partitioned by clause defines the

partitioning columns which are different from the data columns and are actually not

stored with the data. The CLUSTERED BY clause specifies which column to use for

bucketing as well as how many buckets to create. The delimited row format specifies

how the rows are stored in the hive table. In the case of the delimited format, this

specifies how the fields are terminated, how the items within collections (arrays or

maps) are terminated and how the map keys are terminated. STORED AS

SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop

SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS

clauses in the above example represent the system defaults.

Table names and column names are case insensitive.

Browsing Tables and Partitions

SHOW TABLES;

To list existing tables in the warehouse; there are many of these, likely more than

you want to browse.

To list tables with prefix 'page'. The pattern follows Java regular expression

syntax (so the period is a wildcard).

To list partitions of a table. If the table is not a partitioned table then an error is

thrown.

To list columns and column types of table.

To list columns and all other properties of table. This prints lot of information and

that too not in a pretty format. Usually used for debugging.

SHOW TABLES 'page.*';SHOW TABLES 'page.*';

SHOW PARTITIONS page_view;

DESCRIBE page_view;

DESCRIBE EXTENDED page_view;

Page 38: Big data analytics graduation project documentation

38 | P a g e

To list columns and all other properties of a partition. This also prints lot of

information which is usually used for debugging.

Loading Data

There are multiple ways to load data into Hive tables. The user can create an

external table that points to a specified location within HDFS. In this particular usage,

the user can copy a file into the specified location using the HDFS put or copy

commands and create a table pointing to this location with all the relevant row format

information. Once this is done, the user can transform the data and insert them into any

other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma

separated page views served on 2008-06-08, and this needs to be loaded into the

page_view table in the appropriate partition, the following sequence of commands can

achieve this:

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User',

country STRING COMMENT 'country of origination')

COMMENT 'This is the staging page view table'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED

BY '12'

STORED AS TEXTFILE

LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',

country='US')

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,

pvs.ip

WHERE pvs.country = 'US';

Page 39: Big data analytics graduation project documentation

39 | P a g e

In the example above nulls are inserted for the array and map types in the

destination tables but potentially these can also come from the external table if the

proper row formats are specified.

This method is useful if there is already legacy data in HDFS on which the user

wants to put some metadata so that the data can be queried and manipulated using

Hive.

Additionally, the system also supports syntax that can load the data from a file in

the local files system directly into a Hive table where the input data format is the same

as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then

we do not need any additional filtering as shown in the previous example. The load in

this case can be done using the following syntax:

The path argument can take a directory (in which case all the files in the directory

are loaded), a single file name, or a wildcard (in which case all the matching files are

uploaded). If the argument is a directory - it cannot contain subdirectories. Similarly - the

wildcard must match file names only.

In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user

may decide to do a parallel load of the data (using tools that are external to Hive). Once

the file is in HDFS - the following syntax can be used to load the data into a Hive table:

It is assumed that the array and map fields in the input.txt files are null fields for these

examples.

Simple Query

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO

TABLE page_view PARTITION(date='2008-06-08', country='US')

LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO

TABLE page_view PARTITION(date='2008-06-08', country='US')

INSERT OVERWRITE TABLE user_active

SELECT user.*

FROM user

WHERE user.active = 1;

Page 40: Big data analytics graduation project documentation

40 | P a g e

Note that unlike SQL, we always insert the results into a table. We will illustrate later

how the user can inspect these results and even dump them to a local file. You can also

run the following query on Hive CLI:

This will be internally rewritten to some temporary file and displayed to the Hive client

side.

Partition Based Query

What partitions to use in a query is determined automatically by the system on the

basis of where clause conditions on partition columns. For example, in order to get all

the page_views in the month of 03/2008 referred from domain xyz.com, one could write

the following query:

Note that page_views.date is used here because the table (above) was defined with

PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition

something different, don't expect .date to do what you think!

Joins

In order to get a demographic breakdown (by gender) of page_view of 2008-03-03

one would need to join the page_view table and the user table on the userid column.

This can be accomplished with a join as shown in the following query:

INSERT OVERWRITE TABLE xyz_com_page_views

SELECT page_views.*

FROM page_views

WHERE page_views.date >= '2008-03-01' AND page_views.date

<= '2008-03-31' AND

page_views.referrer_url like '%xyz.com';

SELECT user.*

FROM user

WHERE user.active = 1;

Page 41: Big data analytics graduation project documentation

41 | P a g e

In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT

OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left

preserved, right preserved or both sides preserved). For example, in order to do a full

outer join in the query above, the corresponding syntax would look like the following

query:

In order check the existence of a key in another table, the user can use LEFT SEMI

JOIN as illustrated by the following example.

INSERT OVERWRITE TABLE pv_users

SELECT pv.*, u.gender, u.age

FROM user u JOIN page_view pv ON (pv.userid = u.id)

WHERE pv.date = '2008-03-03';

INSERT OVERWRITE TABLE pv_users

SELECT pv.*, u.gender, u.age

FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)

WHERE pv.date = '2008-03-03';

INSERT OVERWRITE TABLE pv_users

SELECT u.*

FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)

WHERE pv.date = '2008-03-03';

In order to join more than one tables, the user can use the following syntax:

INSERT OVERWRITE TABLE pv_friends

SELECT pv.*, u.gender, u.age, f.friends

FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id =

f.uid)

WHERE pv.date = '2008-03-03';

Page 42: Big data analytics graduation project documentation

42 | P a g e

Note that Hive only supports equi-joins. Also it is best to put the largest table on the

rightmost side of the join to get the best performance.

Aggregations

In order to count the number of distinct users by gender one could write the following

query:

Multiple aggregations can be done at the same time, however, no two aggregations can

have different DISTINCT columns .e.g while the following is possible

INSERT OVERWRITE TABLE pv_gender_sum

SELECT pv_users.gender, count (DISTINCT pv_users.userid)

FROM pv_users

GROUP BY pv_users.gender;

INSERT OVERWRITE TABLE pv_gender_agg

SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*),

sum(DISTINCT pv_users.userid)

FROM pv_users

GROUP BY pv_users.gender;

however, the following query is not allowed

INSERT OVERWRITE TABLE pv_gender_agg

SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT

pv_users.ip)

FROM pv_users

GROUP BY pv_users.gender;

Page 43: Big data analytics graduation project documentation

43 | P a g e

Multi Table/File Inserts

The output of the aggregations or simple selects can be further sent into multiple

tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities).

e.g. if along with the gender breakdown, one needed to find the breakdown of unique

page views by age, one could accomplish that with the following query:

The first insert clause sends the results of the first group by to a Hive table while the

second one sends the results to a hadoop dfs files.

Dynamic-partition Insert

In the previous examples, the user has to know which partition to insert into and only

one partition can be inserted in one insert statement. If you want to load into multiple

partitions, you have to use multi-insert statement as illustrated below.

FROM pv_users INSERT OVERWRITE TABLE

pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid)

GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'

SELECT pv_users.age, count_distinct(pv_users.userid)

GROUP BY pv_users.age;

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',

country='US')

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,

null, null, pvs.ip WHERE pvs.country = 'US'

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',

country='CA')

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,

pvs.ip WHERE pvs.country = 'CA'

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',

country='UK')

SELECT pvs.viewTime, pvs.userid, pvs.page_url,

pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

Page 44: Big data analytics graduation project documentation

44 | P a g e

In order to load data into all country partitions in a particular day, you have to add an

insert statement for each country in the input data. This is very inconvenient since you

have to have the priori knowledge of the list of countries exist in the input data and

create the partitions beforehand. If the list changed for another day, you have to modify

your insert DML as well as the partition creation DDLs. It is also inefficient since each

insert statement may be turned into a MapReduce Job.

Dynamic-partition insert (or multi-partition insert) is designed to solve this

problem by dynamically determining which partitions should be created and

populated while scanning the input table. This is a newly added feature that is

only available from version 0.6.0. In the dynamic partition insert, the input column

values are evaluated to determine which partition this row should be inserted

into. If that partition has not been created, it will create that partition

automatically. Using this feature you need only one insert statement to create

and populate all necessary partitions. In addition, since there is only one insert

statement, there is only one corresponding MapReduce job. This significantly

improves performance and reduce the Hadoop cluster workload comparing to the

multiple insert case.

Below is an example of loading data to all country partitions using one insert statement:

There are several syntactic differences from the multi-insert statement:

country appears in the PARTITION specification, but with no value

associated. In this case, country is a dynamic partition column. On the

other hand, ds has a value associated with it, which means it is a static

partition column. If a column is dynamic partition column, its value will be

coming from the input column. Currently we only allow dynamic partition

columns to be the last column(s) in the partition clause because the

partition column order indicates its hierarchical order (meaning dt is the

root partition, and country is the child partition). You cannot specify a

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-

06-08', country)

SELECT pvs.viewTime, pvs.userid, pvs.page_url,

pvs.referrer_url, null, null, pvs.ip, pvs.country

Page 45: Big data analytics graduation project documentation

45 | P a g e

partition clause with (dt, country='US') because that means you need to

update all partitions with any date and its country sub-partition is 'US'.

An additional pvs.country column is added in the select statement. This is

the corresponding input column for the dynamic partition column. Note

that you do not need to add an input column for the static partition column

because its value is already known in the PARTITION clause. Note that

the dynamic partition values are selected by ordering, not name, and

taken as the last columns from the select clause.

Semantics of the dynamic partition insert statement:

When there are already non-empty partitions exists for the dynamic

partition columns, (e.g., country='CA' exists under some ds root partition),

it will be overwritten if the dynamic partition insert saw the same value

(say 'CA') in the input data. This is in line with the 'insert overwrite'

semantics. However, if the partition value 'CA' does not appear in the input

data, the existing partition will not be overwritten.

Since a Hive partition corresponds to a directory in HDFS, the partition

value has to conform to the HDFS path format (URI in Java). Any

character having a special meaning in URI (e.g., '%', ':', '/', '#') will be

escaped with '%' followed by 2 bytes of its ASCII value.

If the input column is a type different than STRING, its value will be first

converted to STRING to be used to construct the HDFS path.

If the input column value is NULL or empty string, the row will be put into a

special partition, whose name is controlled by the hive parameter

hive.exec.default.partition.name. The default value is

__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad"

rows whose value are not valid partition names. The caveat of this approach is

that the bad value will be lost and is replaced by

__HIVE_DEFAULT_PARTITION__ if you select them Hive. JIRA HIVE-1309 is a

solution to let user specify "bad file" to retain the input partition column values as

well.

Dynamic partition insert could potentially resource hog in that it could generate a

large number of partitions in a short time. To get yourself buckled, we define

three parameters:

o hive.exec.max.dynamic.partitions.pernode (default value being 100) is the

maximum dynamic partitions that can be created by each mapper or

reducer. If one mapper or reducer created more than that the threshold, a

fatal error will be raised from the mapper/reducer (through counter) and

the whole job will be killed.

Page 46: Big data analytics graduation project documentation

46 | P a g e

o hive.exec.max.dynamic.partitions (default value being 1000) is the total

number of dynamic partitions could be created by one DML. If each

mapper/reducer did not exceed the limit but the total number of dynamic

partitions does, then an exception is raised at the end of the job before the

intermediate data are moved to the final destination.

o hive.exec.max.created.files (default value being 100000) is the maximum

total number of files created by all mappers and reducers. This is

implemented by updating a Hadoop counter by each mapper/reducer

whenever a new file is created. If the total number is exceeding

hive.exec.max.created.files, a fatal error will be thrown and the job will be

killed.

o Another situation we want to protect against dynamic partition insert is that

the user may accidentally specify all partitions to be dynamic partitions

without specifying one static partition, while the original intention is to just

overwrite the sub-partitions of one root partition. We define another

parameter hive.exec.dynamic.partition.mode=strict to prevent the all-

dynamic partition case. In the strict mode, you have to specify at least one

static partition. The default mode is strict. In addition, we have a

parameter hive.exec.dynamic.partition=true/false to control whether to

allow dynamic partition at all. The default value is false.

Troubleshooting and best practices:

As stated above, there are too many dynamic partitions created by a particular

mapper/reducer, a fatal error could be raised and the job will be killed. The error

message looks something like:

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(dt,

country)

SELECT pvs.viewTime, pvs.userid, pvs.page_url,

pvs.referrer_url, null, null, pvs.ip,

from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds,

pvs.country;

...

2010-05-07 11:10:19,816 Stage-1 map = 0%, reduce = 0%

[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.

Ended Job = job_201005052204_28178 with errors

Page 47: Big data analytics graduation project documentation

47 | P a g e

The problem of this that one mapper will take a random set of rows and it is very

likely that the number of distinct (dt, country) pairs will exceed the limit of

hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by

the dynamic partition columns in the mapper and distribute them to the reducers where

the dynamic partitions will be created. In this case the number of distinct dynamic

partitions will be significantly reduced. The above example query could be rewritten to:

This query will generate a MapReduce job rather than Map-only job. The SELECT-

clause will be converted to a plan to the mappers and the output will be distributed to

the reducers based on the value of (ds, country) pairs. The INSERT-clause will be

converted to the plan in the reducer which writes to the dynamic partitions.

Inserting into local files

In certain situations you would want to write the output into a local file so that you could

load it into an excel spreadsheet. This can be accomplished with the following

command:

;

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(dt,

country)

SELECT pvs.viewTime, pvs.userid, pvs.page_url,

pvs.referrer_url, null, null, pvs.ip,

from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds,

pvs.country

DISTRIBUTE BY ds, country;

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'

SELECT pv_gender_sum.*

FROM pv_gender_sum

Page 48: Big data analytics graduation project documentation

48 | P a g e

Sampling

The sampling clause allows the users to write queries for samples of the data instead of

the whole table. Currently the sampling is done on the columns that are specified in the

CLUSTERED BY clause of the CREATE TABLE statement. In the following example we

choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:

has to be a multiple or divisor of the number of buckets in that table as specified at the

table creation time. The buckets chosen are determined if bucket_number module y is

equal to x. So in the above example the following tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 16)

would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.

On the other hand the tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

would pick out half of the 3rd bucket.

Union all

The language also supports union all, e.g. if we suppose there are two different tables

that track which user has published a video and which user has published a comment,

the following query joins the results of a union all with the user table to create a single

annotated stream for all the video publishing and comment publishing events:

INSERT OVERWRITE TABLE pv_gender_sum_sample

SELECT pv_gender_sum.*

FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

In general the TABLESAMPLE syntax looks like:

TABLESAMPLE(BUCKET x OUT OF y)

Page 49: Big data analytics graduation project documentation

49 | P a g e

Array Operations

Array columns in tables can only be created programmatically currently. We will be

extending this soon to be available as part of the create table statement. For the

purpose of the current example assume that pv.friends is of the type array<INT> i.e. it is

an array of integers.The user can get a specific element in the array by its index as

shown in the following command:

SELECT pv.friends[2]

FROM page_views pv;

The select expressions gets the third item in the pv.friends array.

The user can also get the length of the array using the size function as shown below:

SELECT pv.userid, size(pv.friends)

FROM page_view pv;

INSERT OVERWRITE TABLE actions_users

SELECT u.id, actions.date

FROM (

SELECT av.uid AS uid

FROM action_video av

WHERE av.date = '2008-06-03'

UNION ALL

SELECT ac.uid AS uid

FROM action_comment ac

WHERE ac.date = '2008-06-03'

) actions JOIN users u ON(u.id = actions.uid);

Page 50: Big data analytics graduation project documentation

50 | P a g e

Map(Associative Arrays) Operations

Maps provide collections similar to associative arrays. Such structures can only be

created programmatically currently. We will be extending this soon. For the purpose of

the current example assume that pv.properties is of the type map<String, String> i.e. it

is an associative array from strings to string. Accordingly, the following query:

;

can be used to select the 'page_type' property from the page_views table.

Similar to arrays, the size function can also be used to get the number of elements in a

map as shown in the following query:

SELECT size(pv.properties)

FROM page_view pv;

Custom map/reduce scripts

Users can also plug in their own custom mappers and reducers in the data stream

by using features natively supported in the Hive language. e.g. in order to run a custom

mapper script - map_script - and a custom reducer script - reduce_script - the user can

issue the following command which uses the TRANSFORM clause to embed the

mapper and the reducer scripts.

Note that columns will be transformed to string and delimited by TAB before feeding

to the user script, and the standard output of the user script will be treated as TAB-

separated string columns. User scripts can output debug information to standard error

which will be shown on the task detail page on hadoop.

INSERT OVERWRITE page_views_map

SELECT pv.userid, pv.properties['page type']

FROM page_views pv

Page 51: Big data analytics graduation project documentation

51 | P a g e

Schema-less map/reduce: If there is no "AS" clause after "USING map_script",

Hive assumes the output of the script contains 2 parts: key which is before the first tab,

and value which is the rest after the first tab. Note that this is different from specifying

"AS key, value" because in that case value will only contains the portion between the

first tab and the second tab if there are multiple tabs.

FROM (FROM pv_users

MAP pv_users.userid, pv_users.date

USING 'map_script'

AS dt, uid

CLUSTER BY dt) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.dt, map_output.uid

USING 'reduce_script'

AS date, count;

Sample map script (weekday_mapper.py )

import sys

import datetime

for line in sys.stdin:

line = line.strip()

userid, unixtime = line.split('\t')

weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print ','.join([userid, str(weekday)])

Of course, both MAP and REDUCE are "syntactic sugar" for the more general

select transform. The inner query could also have been written as such:

SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS

dt, uid CLUSTER BY dt FROM pv_users;

Page 52: Big data analytics graduation project documentation

52 | P a g e

In this way, we allow users to migrate old map/reduce scripts without knowing the

schema of the map output. User still needs to know the reduce output schema because

that has to match what is in the table that we are inserting to.

Distribute By and Sort By: Instead of specifying "cluster by", the user can specify

"distribute by" and "sort by", so the partition columns and sort columns can be different.

The usual case is that the partition columns are a prefix of sort columns, but that is not

required.

FROM (

FROM pv_users

MAP pv_users.userid, pv_users.date

USING 'map_script'

CLUSTER BY key) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.dt, map_output.uid

USING 'reduce_script'

AS date, count;

FROM ( FROM pv_users

MAP pv_users.userid, pv_users.date

USING 'map_script' AS c1, c2, c3

DISTRIBUTE BY c2 SORT BY c2, c1) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.c1, map_output.c2, map_output.c3

USING 'reduce_script'

AS date, count;

Page 53: Big data analytics graduation project documentation

53 | P a g e

Co-Groups

Amongst the user community using map/reduce, cogroup is a fairly common

operation wherein the data from multiple tables are sent to a custom reducer such that

the rows are grouped by the values of certain columns on the tables. With the UNION

ALL operator and the CLUSTER BY specification, this can be achieved in the Hive

query language in the following way. Suppose we wanted to cogroup the rows from the

actions_video and action_comments table on the uid column and send them to the

'reduce_script' custom reducer, the following syntax can be used by the user:

Altering Tables

To rename existing table to a new name. If a table with new name already exists then

an error is returned:

FROM ( FROM (

FROM action_video av

SELECT av.uid AS uid, av.id AS id, av.date AS date

UNION ALL

FROM action_comment ac

SELECT ac.uid AS uid, ac.id AS id, ac.date AS date

) union_actions

SELECT union_actions.uid, union_actions.id, union_actions.date

CLUSTER BY union_actions.uid) map

INSERT OVERWRITE TABLE actions_reduced

SELECT TRANSFORM(map.uid, map.id, map.date) USING

'reduce_script' AS (uid, id, reduced_val);

Page 54: Big data analytics graduation project documentation

54 | P a g e

To rename the columns of an existing table. Be sure to use the same column types, and

to include an entry for each preexisting column:

Note that a change in the schema (such as the adding of the columns), preserves the

schema for the old partitions of the table in case it is a partitioned table. All the queries

that access these columns and run over the old partitions implicitly return a null value or

the specified default values for these columns.

In the later versions we can make the behavior of assuming certain values as opposed

to throwing an error in case the column is not found in a particular partition configurable.

Dropping Tables and Partitions

Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this

is a future feature) that would have been built on the table. The associated command is

ALTER TABLE old_table_name RENAME TO new_table_name;

ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

To add columns to an existing table:

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int

column', c2 STRING DEFAULT 'def val');

DROP TABLE pv_users;

To dropping a partition. Alter the table to drop the partition.

ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')

Page 55: Big data analytics graduation project documentation

55 | P a g e

Simple Example Use Cases

MovieLens User Ratings

First, create a table with tab-delimited text file format:

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of

COUNT(*).

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default.

For default Apache weblog, we can create a table with the following command.

add jar ../build/contrib/hive_contrib.jar;

Hive Client: (Programming languages that supported by Hive)

Command Line

JDBC- Java database connectivity:

JDBC Client Sample Code

Running the JDBC Sample Code

JDBC Client Setup for a Secure Cluster

CREATE TABLE u_data (

userid INT,

movieid INT,

rating INT,

unixtime STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

Extract the data files and load it into the table that was just created:

LOAD DATA LOCAL INPATH 'ml-data/u.data'

OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

SELECT COUNT(*) FROM u_data;

Note that for versions of Hive which don't include HIVE-287, you'll need to

use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:

import sys

import datetime

for line in sys.stdin:

line = line.strip()

userid, movieid, rating, unixtime = line.split('\t')

weekday =

datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print '\t'.join([userid, movieid, rating, str(weekday)])

Page 56: Big data analytics graduation project documentation

56 | P a g e

Python

PHP

Thrift Java Client

ODBC –open database connectivity: (standard C programming language

middleware API for accessing database management systems)

Thrift C++ Client

Cassandra Query Language

Cassandra

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value

store. Cassandra brings together the distributed systems technologies from Dynamo

and the data model from Google's BigTable.

Like Dynamo, Cassandra is eventually consistent.

Like BigTable, Cassandra provides a ColumnFamily-based data model

richer than typical key/value systems.

Cassandra is a distributed storage system for managing very large amounts of data

spread out across many commodity servers, while providing highly available service

with no single point of failure.

Like a big hash table of hash tables

Column Database (schemaless)

Highly scalable

Add nodes in minutes

Fault tolerant

Distributed

Tunable

Page 57: Big data analytics graduation project documentation

57 | P a g e

Cassandra Query Language (CQL)

Effectively a structured, Attempting to push as much server-side as possible, familiar

syntax, User friendly API, SQL like query language.

CQL Advantages & Disadvantages

Advantages Disadvantages

Readability

Ease of use

SQL like

Stable

No joins

no ACID transactions

No Adhoc Queries

No group by

No order by

CQL Syntax:

CREATE KEYSPACE:

Synopsis

Description

CREATE KEYSPACE creates a top-level namespace and sets the replica placement

strategy (and associated replication options) for the keyspace. Valid keyspace names

are strings of alpha-numeric characters and underscores, and must begin with a letter.

Properties such as replication strategy and count are specified during creation.

CREATE KEYSPACE <ks_name>

WITH strategy_class = <value>

[ AND strategy_options:<option> = <value> [...] ];

Page 58: Big data analytics graduation project documentation

58 | P a g e

Examples

Define a new keyspace using the simple replication strategy:

Define a new keyspace using a network-aware replication strategy and snitch. This

example assumes you are using the PropertyFileSnitch and your data centers are

named DC1 and DC2 in the cassandra-topology.properties file:

CREATE INDEX

Synopsis

Description

CREATE INDEX creates a new, automatic secondary index on the given column family

for the named column. Optionally, specify a name for the index itself before

the ON keyword. Enclose a single column name in parentheses. The column must

already have a type specified when the family was created, or added afterward by

altering the column family

Examples

Define a static column family and then create a secondary index on two of its named

columns:

CREATE KEYSPACE MyKeyspace WITH strategy_class = 'SimpleStrategy'

AND strategy_options:replication_factor = 1;

CREATE KEYSPACE MyKeyspace WITH strategy_class = 'NetworkTopologyStrategy'

AND strategy_options:DC1 = 3 AND strategy_options:DC2 = 3;

CREATE INDEX [<index_name>]

ON <cf_name> (<column_name>);

CREATE TABLE users (

KEY uuid PRIMARY KEY,

firstname text,

lastname text, email text,

address text,

zip int,

state text);

Page 59: Big data analytics graduation project documentation

59 | P a g e

CREATE TABLE

Synopsis

Description

CREATE TABLE creates new column family namespaces under the

current keyspace. You can also use the

alias CREATE COLUMNFAMILY. Valid column family names are strings

of alphanumeric characters and underscores, which begin with a letter.

The only schema information that must be defined for a column family is

the primary key (or row key) and its associated data type. Other column

metadata, such as the size of the associated row and key caches, can

be defined.

Specifying the Key Type

When creating a new column family, specify the key type. The list of possible types is

identical to column comparators/validators (see CQL Data Types).

Specifying Column Types (optional)

You can assign columns a type during column family creation. These

columns are validated when a write occurs, and intelligent CQL drivers

and interfaces can decode the column values correctly when receiving

them.

Column Family Options (not required)

A number of optional keyword arguments can be supplied to control the

configuration of a new column family. See CQL Column Family Storage

Parameters for the column family options you can define.

CREATE TABLE <cf_name> (

<key_column_name> <data_type> PRIMARY KEY

[, <column_name> <data_type> [, ...] ] )

[ WITH <storage_parameter> = <value>

[AND <storage_parameter> = <value> [...] ] ];

Page 60: Big data analytics graduation project documentation

60 | P a g e

Examples

Dynamic column family definition:

ALTER TABLE

Manipulates the column metadata of a column family.

Synopsis

Description

ALTER TABLE manipulates the column family metadata. You can

change the data storage type of columns, add new columns, drop

existing columns, and change column family properties. No results are

returned.

You can also use the alias ALTER COLUMNFAMILY.

Changing the Type of a Typed Column

Adding a Typed Column.

Dropping a Typed Column

Modifying Column Family Options

ALTER TABLE <name>

(ALTER <column_name> TYPE <data_type>

| ADD <column_name> <data_type>

| DROP <column_name>

| WITH <optionname> = <val> [AND <optionname> = <val> [...]]);

CREATE TABLE user_events (user text PRIMARY KEY)

WITH comparator=timestamp AND default_validation=int;

Static column family definition:

CREATE TABLE users (

KEY uuid PRIMARY KEY,

username text, email text )

WITH comment='user information'

AND read_repair_chance = 1.0;

Page 61: Big data analytics graduation project documentation

61 | P a g e

.

Examples

DELETE

Removes one or more columns from the named row(s).

Synopsis

Description

A DELETE statement removes one or more columns from one or more rows in the

named column family.

Specifying Columns

After the DELETE keyword, optionally list column names, separated by commas.

When no column names are specified, the entire row(s) specified in the WHERE clause

are deleted.

Specifying the Column Family

The column family name follows the list of column names and the keyword FROM.

ALTER TABLE users ALTER email TYPE varchar;

ALTER TABLE users ADD gender varchar;

ALTER TABLE users DROP gender;

ALTER TABLE users WITH comment = 'active users' AND read_repair_chance = 0.2;

DELETE [<column_name> [, ...]]

FROM <column_family>

[USING CONSISTENCY <consistency_level> [AND TIMESTAMP <integer>]]

WHERE <row_specification>;

<row_specification> is:

KEY | <key_alias> = <key_value>

KEY | <key_alias> IN (<key_value> [,...])

Page 62: Big data analytics graduation project documentation

62 | P a g e

Specifying Options

You can specify these options:

Consistency level

Timestamp for the written columns.

When a column is deleted, it is not removed from disk immediately. The deleted column

is marked with a tombstone and then removed after the configured grace period has

expired. The optional timestamp defines the new tombstone record. See About

Deletes for more information about how Cassandra handles deleted columns and rows.

Specifying Rows

The WHERE clause specifies which row or rows to delete from the column family.

Example

INSERT

Adds or updates one or more columns in the identified row of a column family.

Synopsis

DELETE email, phone

FROM users

USING CONSISTENCY QUORUM AND TIMESTAMP 1318452291034

WHERE user_name = 'jsmith';

DELETE phone FROM users WHERE KEY IN ('jdoe', 'jsmith');

INSERT INTO <column_family> (<key_name>, <column_name> [, ...])

VALUES (<key_value>, <column_value> [, ...])

[USING <write_option> [AND <write_option> [...] ] ];

<write_option> is:

CONSISTENCY <consistency_level>

TTL <seconds>

TIMESTAMP <integer>

Page 63: Big data analytics graduation project documentation

63 | P a g e

Description

An INSERT writes one or more columns to a record in a Cassandra column family. No

results are returned. The first column name in the INSERT list must be the name of the

column family key (either the KEYkeyword or the row key alias defined on the column

family.

Specifying Options

You can specify these options:

Consistency level

Time-to-live (TTL)

Timestamp for the written columns.

TTL columns are automatically marked as deleted (with a tombstone) after the

requested amount of time has expired.

Example

SELECT

Retrieves data from a Cassandra column family.

Synopsis

INSERT INTO NerdMovies (KEY, 11924)

VALUES ('cfd66ccc-d857-4e90-b1e5-df98a3d40cd6', 'johndoe')

USING CONSISTENCY LOCAL_QUORUM AND TTL 86400;

SELECT [FIRST <n>] [REVERSED] <select expression>

FROM <column family>

[USING <consistency>]

[WHERE (<clause>)] [LIMIT <n>]

<select expression> syntax is:

{ <start_of_range> .. <end_of_range> | * }

| COUNT(* | 1)

KEY | <key_alias> { = | < | > | <= | >= } <key_value>

KEY | <key_alias> IN (<key_value> [,...])

Page 64: Big data analytics graduation project documentation

64 | P a g e

Description

A SELECT expression reads one or more records from a Cassandra column family and

returns a result-set of rows. Each row consists of a row key and a collection of columns

corresponding to the query.

Unlike the projection in a SQL SELECT, there is no guarantee that the results will

contain all of the columns specified because Cassandra is schema-optional. An error

does not occur if you request non-existent columns.

Examples

Specifying Columns

The SELECT expression determines which columns, if any, appear in the result:

SELECT * from People;

Select two columns, Name and Occupation, from three rows having keys 199, 200, or

207:

SELECT Name, Occupation FROM People WHERE key IN (199, 200, 207);.

Specifying a Range of Columns

To specify a range of columns, specify the start and end column names separated by

two periods (..). Select a range of columns from all rows, but limit the number of

columns to 3 per row starting with the end of the range:

SELECT FIRST 3 REVERSED 'time199'..'time100' FROM Events;

.

Counting Returned Rows

A SELECT expression using COUNT(*) returns the number of rows that matched the

query. Alternatively, you can use COUNT(1) to get the same result.

Count the number of rows in the users column family:

SELECT COUNT(*) FROM users;

Page 65: Big data analytics graduation project documentation

65 | P a g e

If you do not specify a limit, a maximum of 10,000 rows are returned by default. Using

the LIMIT option, you can specify that the query return a greater or fewer number of

rows.

Specifying the Column Family, FROM, Clause

Count the number of rows in the Migrations column family in the system keyspace:

SELECT COUNT(*) FROM system.Migrations;

Specifying a Consistency Level

You can optionally specify a consistency level, such as QUORUM:

SELECT * from People USING CONSISTENCY QUORUM;

See tunable consistency for more information about the consistency levels.

Filtering Data Using the WHERE Clause

Relational operators are: =, >, >=, <, or <=.

To filter a indexed column, the term on the left of the operator must be the name of the

column, and the term on the right must be the value to filter on.

Specifying Rows Returned Using LIMIT

By default, a query returns 10,000 rows maximum. Using the LIMIT clause, you can

change the default limit of 10,000 rows to a lesser or greater number of rows. The

default is 10,000 rows.

SELECT * from Artists WHERE favoriteArtist = 'Adele' LIMIT 90000;

SELECT COUNT(*) FROM big_columnfamily;

count

-------

10000

SELECT COUNT(*) FROM big_columnfamily LIMIT 50000;

count

-------

50000

SELECT COUNT(*) FROM big_columnfamily LIMIT 200000;

count

--------

105291

Page 66: Big data analytics graduation project documentation

66 | P a g e

UPDATE

Updates one or more columns in the identified row of a column family.

Synopsis

Description

An UPDATE writes one or more columns to a record in a Cassandra column family.

You can specify these options:

Consistency level

Time-to-live (TTL)

Timestamp for the written columns.

TTL columns are automatically marked as deleted (with a tombstone) after the

requested amount of time has expired.

UPDATE <column_family>

[ USING <write_option> [ AND <write_option> [...] ] ];

SET <column_name> = <column_value> [, ...]

| <counter_column_name> = <counter_column_name> {+ | -} <integer>

WHERE <row_specification>;

<write_option> is:

CONSISTENCY <consistency_level>

TTL <seconds>

TIMESTAMP <integer>

<row_specification> is:

KEY | <key_alias> = <key_value>

KEY | <key_alias> IN (<key_value> [,...])

Page 67: Big data analytics graduation project documentation

67 | P a g e

Examples

Update a column in several rows at once

Update several columns in a single row:

Update the value of a counter column:

TRUNCATE

Removes all data from a column family.

Synopsis

Description

uPDATE users USING CONSISTENCY QUORUM

SET 'state' = 'TX'

WHERE KEY IN (88b8fd18-b1ed-4e96-bf79-4280797cba80,

06a8913c-c0d6-477c-937d-6c1b69a95d43,

bc108776-7cb5-477f-917d-

869c12dfffa8);

UPDATE page_views USING CONSISTENCY QUORUM AND

TIMESTAMP=1318452291034

SET 'index.html' = 'index.html' + 1

WHERE KEY = 'www.datastax.com';

UPDATE users USING CONSISTENCY QUORUM

SET 'name' = 'John Smith', 'email' = '[email protected]'

WHERE user_uuid = 88b8fd18-b1ed-4e96-bf79-4280797cba80;

TRUNCATE <column_family>;

Page 68: Big data analytics graduation project documentation

68 | P a g e

A TRUNCATE statement results in the immediate, irreversible removal of all data in the

named column family.

Example

USE

Connects the current client session to a keyspace.

Synopsis

Description

A USE statement tells the current client session and the connected Cassandra instance

which keyspace you will be working in.

Example

DROP INDEX

Drops the named secondary index.

Synopsis

Description

A DROP INDEX statement removes an existing secondary index. If the index was not

given a name during creation, the index name

is <columnfamily_name>_<column_name>_idx.

USE <keyspace_name>;

TRUNCATE user_activity;

DROP INDEX <name>;

USE PortfolioDemo;

Page 69: Big data analytics graduation project documentation

69 | P a g e

Example

DROP TABLE

Removes the named column family.

Synopsis

Description

A DROP TABLE statement results in the immediate, irreversible removal of a column

family, including all data contained in the column family. You can also use the

alias DROP TABLE.

Example

CQL (Platforms) Support:

PHP

Python

Java

Ruby

CQL (IDEs) Support:

Eclipse

Netbeans

Python

DROP TABLE worldSeriesAttendees;

DROP TABLE <name>;

DROP INDEX user_state;

DROP INDEX users_zip_idx;

Page 70: Big data analytics graduation project documentation

70 | P a g e

4.1 Clustering

4.1.1 Concept of Clustering

A computer cluster consists of a set of loosely connected or tightly connected machines that work together so that in many respects they can be viewed as a single system.

The components of a cluster are usually connected to each other through fast local area network ("LAN"), with each node (computer used as a server) running its own instance of an operating system. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing.

Clusters are usually deployed to improve performance and availability over that of a single machine while typically being much more cost-effective than single machine of comparable speed or availability.

The desire to get more computing power and better reliability by orchestrating a number of low cost commercial off-the-shelf computers has given rise to a variety of architectures and configurations.

The computer clustering approach usually (but not always) connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast local area network. The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit.

Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer to peer or grid computing which also uses many nodes, but with a far more distributed nature.

A computer cluster may be a simple two-node system which just connects two personal computers, or may be a very fast supercomputer. A basic approach to building a cluster is that of a Beowulf cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional high performance computing.

Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance.

PART Four: Work Environment

Page 71: Big data analytics graduation project documentation

71 | P a g e

Figure 4.1 Computer clustering

4.1.2 Clustering characteristics:

Consists of many of the same or similar type of machines Tightly-coupled using dedicated network connections All machines share resources such as a common home directory They must trust each other so that RSH or SSH does not require a password,

Otherwise you would need to do a manual start on each machine.

4.1.3 Clustering attributes:

Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use high availability approach.

Page 72: Big data analytics graduation project documentation

72 | P a g e

Note that the attributes described below are not exclusive and a "compute cluster" may also use a high-availability approach, etc.

Load balancing clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized. However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a web-server cluster which may just use a simple round-robin method by assigning each new request to a different node.

Computer clusters are used for computation-intensive purposes, rather than handling IO-oriented operations such as web service or database. For instance, a computer cluster might support computational simulation of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach supercomputing.

4.1.4 Clustering Benefits:

Low Cost: Customers can eliminate the cost and complexity of procuring, configuring and operating HPC clusters with low, pay-as-you-go pricing. Further, you can optimize costs by leveraging one of several pricing models: On Demand, Reserved or Spot Instances.

Elasticity: You can add and remove computer resources to meet the size and time requirements for your workloads.

Run Jobs Anytime, Anywhere: You can launch compute jobs using simple APIs or management tools and automate workflows for maximum efficiency and scalability. You can increase your speed of innovation by accessing computer resources in minutes instead of spending time in queues

4.1.5 Clustering management:

One of the challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes. In some cases this provides an advantage to shared memory architectures with lower administration costs. This has also made virtual machines popular, due to the ease of administration.

Task scheduling

When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes a challenge. In a heterogeneous CPU-GPU cluster, which has a complex application environment the performance of each job depends on the characteristics of the underlying cluster, mapping tasks onto CPU cores and

Page 73: Big data analytics graduation project documentation

73 | P a g e

GPU devices provides significant challenges. This is an area of ongoing research and algorithms that combine and extend MapReduce and Hadoop have been proposed and studied.

Node failure management

When a node in a cluster fails, strategies such as "fencing" may be employed to keep the rest of the system operational. Fencing is the process of isolating a node or protecting shared resources when a node appears to be malfunctioning. There are two classes of fencing methods; one disables a node itself, and the other disallows access to resources such as shared disks.

4.1.6 Problems solved using Clustering:

Clusters can be used to solve three typical problems in a data center environment:

Need for High Availability. High availability refers to the ability to provide end user access to a service for a high percentage of scheduled time while attempting to reduce unscheduled outages. A solution is highly available if it meets the organization's scheduled uptime goals. Availability goals are achieved by reducing unplanned downtime and then working to improve total hours of service operation.

Need for High Reliability. High reliability refers to the ability to reduce the frequency of system failure, while attempting to provide fault tolerance in case of failure. A solution is highly reliable if it minimizes the number of single points of failure and reduces the risk that failure of a single component/system will result in the outage of the entire service offering. Reliability goals are achieved using redundant, fault tolerant hardware components, application software and systems.

Need for High Scalability. High scalability refers to the ability to add resources and computers while attempting to improve performance. A solution is highly scalable if it can be scaled up and out. Individual systems in a service offering can be scaled up by adding more resources (for example, CPUs, memory, disks, etc.). The service can be scaled out by adding additional computers.

Page 74: Big data analytics graduation project documentation

74 | P a g e

4.1.7 Setting up single node cluster using Hadoop

Prerequisites

Sun Java 6

Hadoop requires a working Java 1.5+ installation. However, using Java 1.6 is recommended for running Hadoop.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

# Add the Ferramosca Roberto's repository to your apt repositories

# See https://launchpad.net/~ferramroberto/

#

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:ferramroberto/java

# Update the source list

$ sudo apt-get update

# Install Sun Java 6 JDK

$ sudo apt-get install sun-java6-jdk

# Select Sun's Java as the default on your machine.

# See 'sudo update-alternatives --config java' for more information.

#

$ sudo update-java-alternatives -s java-6-sun

The full JDK which will be placed in /usr/lib/jvm/java-6-sun.

After installation, make a quick check whether Sun’s JDK is correctly set up:

1

2

3

4

user@ubuntu:~# java -version

java version "1.6.0_20"

Java(TM) SE Runtime Environment (build 1.6.0_20-b02)

Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Adding a dedicated Hadoop system user

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation

Page 75: Big data analytics graduation project documentation

75 | P a g e

from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

1

2

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

This will add the user hduser and the group hadoop to your local machine.

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

Secure Shell (SSH) is a cryptographic network protocol for secure data communication, remote command-line login, remote command execution, and other secure network services between two networked computers that connects

I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication.

First, we have to generate an SSH key for the hduser user.

1

2

3

4

5

6

7

8

9

10

11

12

user@ubuntu:~$ su - hduser

hduser@ubuntu:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Created directory '/home/hduser/.ssh'.

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

The key fingerprint is:

9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu

The key's randomart image is:

[...snipp...]

hduser@ubuntu:~$

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

Second, you have to enable SSH access to your local machine with this newly created key.

Page 76: Big data analytics graduation project documentation

76 | P a g e

1 hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config

1

2

3

4

5

6

7

8

9

hduser@ubuntu:~$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.

RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686

GNU/Linux

Ubuntu 10.04 LTS

[...snipp...]

hduser@ubuntu:~$

Disabling IPv6

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

1

2

3

4

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

1 $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Page 77: Big data analytics graduation project documentation

77 | P a g e

A return value of 0 means IPv6 is enabled, a value of 1 means disabled.

Hadoop installation:

After downloading Hadoop from the Apache Download Mirrors, extract the contents of the Hadoop package to a location of your choice. For example /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

1

2

3

4

$ cd /usr/local

$ sudo tar xzf hadoop-1.0.3.tar.gz

$ sudo mv hadoop-1.0.3 hadoop

$ sudo chown -R hduser:hadoop hadoop

Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

1

2

3

4

5

6

7

8

9

export HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvm/java-6-sun

unalias fs &> /dev/null

alias fs="hadoop fs"

unalias hls &> /dev/null

alias hls="fs -ls"

lzohead () {

hadoop fs -cat $1 | lzop -dc | head -1000 | less

}

10 export PATH=$PATH:$HADOOP_HOME/bin

Configurations:

Hadoop-env.sh

The only required environment variable we have to configure for Hadoop JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change

Page 78: Big data analytics graduation project documentation

78 | P a g e

Conf/hadoop-env.sh

1

2

# The java implementation to use. Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To

Conf/hadoop-env.sh

1

2

# The java implementation to use. Required.

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Conf/*-site.xml

we will configure the directory where Hadoop will store its data files, the network ports it listens to. Our setup will use Hadoop’s Distributed File System, HDFS.

Now we create the directory and set the required ownerships and permissions:

1

2

3

4

$ sudo mkdir -p /app/hadoop/tmp

$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...

$ sudo chmod 750 /app/hadoop/tmp

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

Conf/core-site.xml

1

2

3

4

5

6

7

8

9

10

11

12

13

14

<property>

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

Page 79: Big data analytics graduation project documentation

79 | P a g e

15 </property>

In file conf/mapred-site.xml:

Conf/mapred-site.xml

1

2

3

4

5

6

7

8

<property>

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

In file conf/hdfs-site.xml:

Conf/hdfs-site.xml

1

2

3

4

5

6

7

8

<property>

<name>dfs.replication</name>

<value>1</value>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the

cluster (in HDFS)!

Page 80: Big data analytics graduation project documentation

80 | P a g e

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

1 hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format

10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = ubuntu/127.0.1.1

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 0.20.2

STARTUP_MSG: build =

https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled

by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010

************************************************************/

10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop

10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup

10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true

10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.

10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has

been successfully formatted.

10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1

************************************************************/

hduser@ubuntu:/usr/local/hadoop$

Starting your single-node cluster

Run the command:

1 hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

The output will look like this:

1

2

hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-

Page 81: Big data analytics graduation project documentation

81 | P a g e

3

4

5

6

7

ubuntu.out

localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-

datanode-ubuntu.out

localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-

hduser-secondarynamenode-ubuntu.out

starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-

ubuntu.out

localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-

tasktracker-ubuntu.out

hduser@ubuntu:/usr/local/hadoop$

The output is:

1

2

3

4

5

6

7

hduser@ubuntu:/usr/local/hadoop$ jps

2287 TaskTracker

2149 JobTracker

1938 DataNode

2085 SecondaryNameNode

2349 Jps

1788 NameNode

You can also check with netstat if Hadoop is listening on the configured ports.

1

2

3

4

5

6

7

8

9

10

11

12

hduser@ubuntu:~$ sudo netstat -plten | grep java

tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java

tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java

tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java

tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java

tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java

tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java

tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java

tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java

tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java

tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java

hduser@ubuntu:~$

Stopping your single-node cluster

Run the command

1 hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

to stop all the daemons running on your machine.

Example output:

Page 82: Big data analytics graduation project documentation

82 | P a g e

1

2

3

4

5

6

7

hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

hduser@ubuntu:/usr/local/hadoop$

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred.

Download example input data

We will use three documents for example:

1

2

3

4

5

6

hduser@ubuntu:~$ ls -l /tmp/gutenberg/

total 3604

-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt

-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt

-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt

hduser@ubuntu:~$

Restart the Hadoop cluster

Restart your Hadoop cluster if it’s not running already.

1 hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

Copy local example data to HDFS

Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

1 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg

/user/hduser/Gutenberg

2 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser

3 Found 1 items

Page 83: Big data analytics graduation project documentation

83 | P a g e

4 drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/Gutenberg

5 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg

6 Found 3 items

7 -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38

/user/hduser/gutenberg/pg20417.txt

8 -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38

/user/hduser/gutenberg/pg4300.txt

9 -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38

/user/hduser/gutenberg/pg5000.txt

10 hduser@ubuntu:/usr/local/hadoop$

Run the MapReduce job

Now, we actually run the WordCount example job.

1 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount

/user/hduser/gutenberg /user/hduser/gutenberg-output

This command will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output.

Example output of the previous command in the console:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount

/user/hduser/gutenberg /user/hduser/gutenberg-output

10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3

10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001

10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0%

10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0%

10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%

10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%

10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001

10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17

10/05/08 17:43:28 INFO mapred.JobClient: Job Counters

10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1

10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3

10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3

10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters

10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026

10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512

10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918

Page 84: Big data analytics graduation project documentation

84 | P a g e

19

20

21

22

23

24

25

26

27

28

29

10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330

10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework

10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290

10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286

10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934

10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796

10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290

10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874

10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267

10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187

10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187

10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

1

2

3

4

5

6

7

8

9

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser

Found 2 items

drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg

drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output

Found 2 items

drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-

output/_logs

-rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-

output/part-r-00000

hduser@ubuntu:/usr/local/hadoop$

Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

1 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-

00000

To read the file directly from HDFS without copying it to the local file system.

1

2

3

4

5

6

7

hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output

/tmp/gutenberg-output

hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output

"(Lo)cra" 1

"1490 1

"1498," 1

Page 85: Big data analytics graduation project documentation

85 | P a g e

8

9

10

11

12

13

14

"35" 1

"40," 1

"A 2

"AS-IS". 1

"A_ 1

"Absoluti 1

"Alack! 1

hduser@ubuntu:/usr/local/hadoop$

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

http://localhost:50070/ – web UI of the NameNode daemon http://localhost:50030/ – web UI of the JobTracker daemon http://localhost:50060/ – web UI of the TaskTracker daemon

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

NameNode Web Interface (HDFS layer)

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

By default, it’s available at http://localhost:50070/.

Page 86: Big data analytics graduation project documentation

86 | P a g e

Figure 4.2 NameNode web interface

JobTracker Web Interface (MapReduce layer)

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.

Page 87: Big data analytics graduation project documentation

87 | P a g e

Figure 4.3 Job tracker web interface

TaskTracker Web Interface (MapReduce layer)

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

By default, it’s available at http://localhost:50060/.

Page 88: Big data analytics graduation project documentation

88 | P a g e

Figure 4.4 Task tracker web interface

4.1.8 Setting up a multi cluster computing using Hadoop:

From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Ubuntu boxes . In my humble opinion, the best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two Ubuntu boxes, and in a second step to “merge” these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master (but also act as a slave with regard to data storage and processing), and the other box will become only a slave. It’s much easier to track down any problems you might encounter due to the reduced complexity of doing a single-node cluster setup first on each machine.

Page 89: Big data analytics graduation project documentation

89 | P a g e

Figure 4.5 Multi node cluster components

Prerequisites:

Firstly, the single node cluster must be installed and configured on both machines as shown before and it is recommended to use the same settings of installation and paths of folders because the two machines will be merged.

One of the Ubuntu boxes will act as a master and slave simultaneously and the other box will act as a slave Network configuration:

This should come hardly as a surprise, but for the sake of completeness I have to point out that both machines must be able to reach each other over the network. The easiest is to put both machines in the same network with regard to hardware and software configuration. To make it simple, we will assign the IP address 192.168.0.1 to the master machine and 192.168.0.2 to the slave machine.

Page 90: Big data analytics graduation project documentation

90 | P a g e

SSH configuaration:

The hduser user on the master (hduser@master) must be able to connect

a. To its own user account on the master b. To the hduser user account on the slave ( hduser@slave) via a password-less SSH

login. You just have to add the hduser@master’s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user’s $HOME/.ssh/authorized_keys). You can do this manually or use the following SSH command:

1 hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

The final step is to test the SSH setup by connecting with user hduser from the master to the user account hduser on the slave. The step is also needed to save slave’s host key fingerprint to the hduser@master’s known_hosts file.

So, connecting from master to master…

1

2

3

4

5

6

7

8

hduser@master:~$ ssh master

The authenticity of host 'master (192.168.0.1)' can't be established.

RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'master' (RSA) to the list of known hosts.

Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686

...

hduser@master:~$

And from master to slave.

1

2

3

4

5

6

hduser@master:~$ ssh slave

The authenticity of host 'slave (192.168.0.2)' can't be established.

RSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'slave' (RSA) to the list of known hosts.

Ubuntu 10.04

Page 91: Big data analytics graduation project documentation

91 | P a g e

7

8

...

hduser@slave:~$

Hadoop configuration:

We will prepare one Ubuntu box to act as a master, slave and the other to act as a

slave

Figure 4.6 multi node cluster layers

The master node will run the “master” daemons for each layer: Name Node for the HDFS storage layer and Job Tracker for the Map Reduce processing layer. Both machines will run the “slave” daemons: Data Node for the HDFS layer and Task Tracker for Map Reduce processing layer. Basically, the “master” daemons are responsible for coordination and management of the “slave” daemons while the latter will do the actual data storage and data processing work.

Page 92: Big data analytics graduation project documentation

92 | P a g e

Configuration

Conf/masters (master only)

Actually, one machine will act as the master including name node and the job tracker, the remaining machines whatever its number will act as slave including data node and task tracker.

On master, update conf/masters that it looks like this: Conf/masters (on master)

1 master

Conf/slaves (master only):

The conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (Data Nodes and Task Trackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

On master, update conf/slaves that it looks like this: Conf/slaves (on master)

1 2

master slave

If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.

Conf/slaves (on master)

1 2 3 4 5

master slave anotherslave01 anotherslave02 anotherslave03

Conf/*-site.xml (all machines)

You must change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.

Page 93: Big data analytics graduation project documentation

93 | P a g e

First, we have to change the fs.default.name parameter (in conf/core-site.xml), which specifies the Name Node (the HDFS master) host and port. In our case, this is the master machine.

Conf/core-site.xml (ALL machines)

1

2

3

4

5

6

7

8

9

<property>

<name>fs.default.name</name>

<value>hdfs://master:54310</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

Second, we have to change the mapred.job.tracker parameter (in conf/mapred-site.xml), which specifies the JobTracker (MapReduce master) host and port. Again, this is the master in our case.

Conf/mapred-site.xml (ALL machines)

1

2

3

4

5

6

7

8

<property>

<name>mapred.job.tracker</name>

<value>master:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

Third, we change the dfs.replication parameter (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of available slave nodes (more precisely, the number of DataNodes), you will start seeing a lot of “(Zero targets found, forbidden1.size=1)” type errors in the log files.

The default value of dfs.replication is 3. However, we have only two nodes available, so we set dfs.replication to 2.

Conf/hdfs-site.xml (ALL machines)

Page 94: Big data analytics graduation project documentation

94 | P a g e

1

2

3

4

5

6

7

8

<property>

<name>dfs.replication</name>

<value>2</value>

<description>Default block replication.

The actual number of replications can be specified when the file is created.

The default is used if replication is not specified in create time.

</description>

</property>

Formatting the HDFS file system via the Name Node

Before we start our new multi-node cluster, we must format Hadoop’s distributed file system (HDFS) via the NameNode. You need to do this the first time you set up a Hadoop cluster.

Warning: Do not format a running cluster because this will erase all existing data in the HDFS file system!

To format the file system (which simply initializes the directory specified by the dfs.name.dir variable on the Name Node), run the command

Format the cluster’s HDFS file system

1

2

3

hduser@master:/usr/local/hadoop$ bin/hadoop namenode -format

... INFO dfs.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully

formatted.

hduser@master:/usr/local/hadoop$

Starting the multi-node cluster

Starting the cluster is performed in two steps.

1. We begin with starting the HDFS daemons: the NameNode daemon is started on master, and DataNode daemons are started on all slaves (here: master and slave).

2. Then we start the MapReduce daemons: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slave).

Page 95: Big data analytics graduation project documentation

95 | P a g e

HDFS daemons

Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

In our case, we will run bin/start-dfs.sh on master:

Start the HDFS layer

1

2

3

4

5

6

7

hduser@master:/usr/local/hadoop$ bin/start-dfs.sh

starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-

master.out

slave: Ubuntu 10.04

slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-

slave.out

master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-

master.out

master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-

secondarynamenode-master.out

hduser@master:/usr/local/hadoop$

On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-datanode-slave.log.

Example output:

1

2

3

4

5

6

7

8

9

10

11

12

13

... INFO org.apache.hadoop.dfs.Storage: Storage directory /app/hadoop/tmp/dfs/data is not

formatted.

... INFO org.apache.hadoop.dfs.Storage: Formatting ...

... INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010

... INFO org.mortbay.util.Credential: Checking Resource aliases

... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4

... INFO org.mortbay.util.Container: Started

org.mortbay.jetty.servlet.WebApplicationHandler@17a8a02

... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]

... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]

... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]

... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075

... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@56a499

... INFO org.apache.hadoop.dfs.DataNode: Starting DataNode in:

Page 96: Big data analytics graduation project documentation

96 | P a g e

FSDataset{dirpath='/app/hadoop/tmp/dfs/data/current'}

... INFO org.apache.hadoop.dfs.DataNode: using BLOCKREPORT_INTERVAL of

3538203msec

As you can see in slave’s output above, it will automatically format its storage directory (specified by the dfs.data.dir parameter) if it is not formatted already. It will also create the directory if it does not exist yet.

At this point, the following Java processes should run on master…

Java processes on master after starting HDFS daemons

1

2

3

4

5

6

hduser@master:/usr/local/hadoop$ jps

14799 NameNode

15314 Jps

14880 DataNode

14977 SecondaryNameNode

hduser@master:/usr/local/hadoop$

And the following on slave. Java processes on slave after starting HDFS daemons

1

2

3

4

hduser@slave:/usr/local/hadoop$ jps

15183 DataNode

15616 Jps

hduser@slave:/usr/local/hadoop$

MapReduce daemons

Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

In our case, we will run bin/start-mapred.sh on master:

Start the MapReduce layer

1

2

3

4

hduser@master:/usr/local/hadoop$ bin/start-mapred.sh

Page 97: Big data analytics graduation project documentation

97 | P a g e

5

6

starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-

master.out

slave: Ubuntu 10.04

slave: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-

slave.out

master: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-

tasktracker-master.out

hduser@master:/usr/local/hadoop$

On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-tasktracker-slave.log. Example output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

... INFO org.mortbay.util.Credential: Checking Resource aliases

... INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4

... INFO org.mortbay.util.Container: Started

org.mortbay.jetty.servlet.WebApplicationHandler@d19bc8

... INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]

... INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]

... INFO org.mortbay.util.Container: Started HttpContext[/static,/static]

... INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50060

... INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@1e63e3d

... INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50050: starting

... INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050: starting

... INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: 50050

... INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_slave:50050

... INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050: starting

... INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all

reduce tasks on tracker_slave:50050

At this point, the following Java processes should run on master… Java processes on master after starting MapReduce daemons

1

2

3

4

5

6

7

8

hduser@master:/usr/local/hadoop$ jps

16017 Jps

14799 NameNode

15686 TaskTracker

14880 DataNode

15596 JobTracker

14977 SecondaryNameNode

hduser@master:/usr/local/hadoop$

And the following on slave. Java processes on slave after starting MapReduce daemons

Page 98: Big data analytics graduation project documentation

98 | P a g e

1

2

3

4

5

hduser@slave:/usr/local/hadoop$ jps

15183 DataNode

15897 TaskTracker

16284 Jps

hduser@slave:/usr/local/hadoop$

Stopping the multi-node cluster

Like starting the cluster, stopping it is done in two steps. The workflow however is the opposite of starting.

1. We begin with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped on all slaves (here: master and slave).

2. Then we stop the HDFS daemons: the NameNode daemon is stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).

MapReduce daemons

Run the command bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

In our case, we will run bin/stop-mapred.sh on master:

Stopping the MapReduce layer

1

2

3

4

5

6

hduser@master:/usr/local/hadoop$ bin/stop-mapred.sh

stopping jobtracker

slave: Ubuntu 10.04

master: stopping tasktracker

slave: stopping tasktracker

hduser@master:/usr/local/hadoop$

Note: The output above might suggest that the JobTracker was running and stopped on

“slave“, but you can be assured that the JobTracker ran on “master“.

At this point, the following Java processes should run on master… Java processes on master after stopping MapReduce daemons

1

2

3

4

hduser@master:/usr/local/hadoop$ jps

14799 NameNode

18386 Jps

14880 DataNode

Page 99: Big data analytics graduation project documentation

99 | P a g e

5

6

14977 SecondaryNameNode

hduser@master:/usr/local/hadoop$

And the following on slave. Java processes on slave after stopping MapReduce daemons

1

2

3

4

hduser@slave:/usr/local/hadoop$ jps

15183 DataNode

18636 Jps

hduser@slave:/usr/local/hadoop$

HDFS daemons

Run the command bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

In our case, we will run bin/stop-dfs.sh on master:

Stopping the HDFS layer

1

2

3

4

5

6

7

hduser@master:/usr/local/hadoop$ bin/stop-dfs.sh

stopping namenode

slave: Ubuntu 10.04

slave: stopping datanode

master: stopping datanode

master: stopping secondarynamenode

hduser@master:/usr/local/hadoop$

(Again, the output above might suggest that the NameNode was running and stopped on slave, but you can be assured that the NameNode ran on master)

At this point, the only following Java processes should run on master… Java processes on master after stopping HDFS daemons

And the following on slave. Java processes on slave after stopping HDFS daemons

1

2

3

hduser@slave:/usr/local/hadoop$ jps

18894 Jps

hduser@slave:/usr/local/hadoop$

1

2

3

hduser@master:/usr/local/hadoop$ jps

18670 Jps

hduser@master:/usr/local/hadoop$

Page 100: Big data analytics graduation project documentation

100 | P a g e

4.2 R

4.2.1 What is R?

R is a free software programming language and a software environment

for statistical computing and graphics. The R language is widely used among

statisticians and data miners for developing statistical software and data analysis. Polls

and surveys of data miners are showing R's popularity has increased substantially in

recent years.

R is an implementation of the S programming language combined with lexical

scoping semantics inspired by Scheme. S was created by John Chambers while at Bell

Labs. R was created by Ross Ihaka and Robert Gentleman at the University of

Auckland, New Zealand, and is currently developed by the R Development Core Team,

of which Chambers is a member. R is named partly after the first names of the first two

R authors and partly as a play on the name of S.

R is a GNU project. The source code for the R software environment is written

primarily in C, Fortran, and R. R is freely available under the GNU General Public

License, and pre-compiled binary versions are provided for various operating systems.

R uses a command line interface; however, several graphical user interfaces are

available for use with R.

4.2.2 Statistical features

R provides a wide variety of statistical and graphical techniques,

in linear and nonlinear modeling, classical statistical tests, time-series analysis,

classification, clustering, and others. R is easily extensible through functions and

extensions, and the R community is noted for its active contributions in terms of

packages. There are some important differences, but much code written for S runs

unaltered. Many of R's standard functions are written in R itself, which makes it easy for

users to follow the algorithmic choices made. For computationally intensive

tasks, C, C++, andFortran code can be linked and called at run time. Advanced users

can write C or Java code to manipulate R objects directly.

R is highly extensible through the use of user-submitted packages for specific

functions or specific areas of study. Due to its S heritage, R has stronger object-oriented

programming facilities than most statistical computing languages. Extending R is also

eased by its lexical scoping rules.

Strength of R is static graphics, which can produce publication-quality graphs,

including mathematical symbols. Dynamic and interactive graphics are available

through additional packages.

R has its own LaTeX-like documentation format, which is used to supply comprehensive

documentation, both on-line in a number of formats and in hard copy.

Page 101: Big data analytics graduation project documentation

101 | P a g e

4.2.3 The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:

An effective data handling and storage facility, A suite of operators for calculations on arrays, in particular matrices, A large, coherent, integrated collection of intermediate tools for data analysis, Graphical facilities for data analysis and display either on-screen or on hardcopy,

and A well-developed, simple and effective programming language which includes

conditionals, loops, user-defined recursive functions and input and output facilities.

4.2.4 Basic syntax of the language

> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)

> y <- x^2 # Square the elements of x > print(y) # print (vector) y

[1] 1 4 9 16 25 36 > mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar

[1] 15.16667 > var(y) # Calculate sample variance

[1] 178.9667 > lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"

# store the results as lm_1 > print(lm_1) # Print the model from the (linear model object) lm_1

Call:

lm(formula = y ~ x)

Coefficients:

(Intercept) x

-9.333 7.000

> summary(lm_1) # Compute and print statistics for the fit

# of the (linear model object) lm_1

Call:

lm(formula = y ~ x)

Residuals:

1 2 3 4 5 6

3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333

Page 102: Big data analytics graduation project documentation

102 | P a g e

Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.3333 2.8441 -3.282 0.030453 *

x 7.0000 0.7303 9.585 0.000662 ***

--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.055 on 4 degrees of freedom

Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478

F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662

> par(mfrow=c(2, 2)) # Request 2x2 plot layout

> plot(lm_1) # Diagnostic plot of regression model

4.2.5 Packages

The capabilities of R are extended through user-created packages, which allow

specialized statistical techniques, graphical devices, import/export capabilities, reporting

tools, etc. These packages are developed primarily in R, and sometimes

in Java, C and Fortran. A core set of packages is included with the installation of R, with

5300 additional packages (as of April 2012) available at the Comprehensive R Archive

Network (CRAN), Bioconductor, and other repositories.

The "Task Views" page (subject list) on the CRAN website lists the wide range of

applications (Finance, Genetics, Machine Learning, Medical Imaging, Social Sciences

and Spatial Statistics) to which R has been applied and for which packages are

available.

Other R package resources include Crantastic, a community site for rating and

reviewing all CRAN packages, and also R-Forge, a central platform for the collaborative

development of R packages, R-related software, and projects. It hosts many

unpublished, beta packages, and development versions of CRAN packages.

The Bioconductor project provides R packages for the analysis of genomic data,

such as Affymetrix and cDNA microarray object-oriented data-handling and analysis

tools, and has started to provide tools for analysis of data from next-generation high-

throughput sequencing methods.

Reproducible research and automated report generation can be accomplished

with packages that support execution of R code embedded

within LaTeX, OpenDocument format and other markups.

Page 103: Big data analytics graduation project documentation

103 | P a g e

4.3 PostgreSQL

PostgreSQL, often simply Postgres, is an object-relational database

management system (ORDBMS) available for many platforms including Linux,

FreeBSD, Solaris, Microsoft Windows and Mac OS X. It is released under the

PostgreSQL License, which is an MIT-style license, and is thus free and open source

software. PostgreSQL is developed by the PostgreSQL Global Development Group,

consisting of a handful of volunteers employed and supervised by companies such

as Red Hat and EnterpriseDB. It implements the majority of

the SQL:2008 standard, is ACID-compliant, is fully transactional (including all DDL

statements), has extensible data types, operators, index methods, functions,

aggregates, procedural languages, and has a large number of extensions written by

third parties.

The vast majority of Linux distributions have PostgreSQL available in supplied

packages. Mac OS X, starting with Lion, has PostgreSQL server as its standard default

database in the server edition, and PostgreSQL client tools in the desktop edition.

4.3.1 Connecting R with PostgreSQL using DBI

RPostgreSQL provides a DBI-compliant database connection from GNU R to PostgreSQL. Development of RPostgreSQL was supported via the Google Summer of Code 2008 program. The package is now available on the CRAN mirror network and can be installed via install.packages() from within R. We use this to retrieve data from Postgre store to R in order to make map reduce job and statistical analysis on data.

library(RPostgreSQL)

## loads the PostgreSQL driver

drv <- dbDriver("PostgreSQL")

## Open a connection

con <- dbConnect(drv, dbname="R_Project")

## Submits a statement

rs <- dbSendQuery(con, "select * from R_Users")

## fetch all elements from the result set

fetch(rs,n=-1)

## Submit and execute the query

dbGetQuery(con, "select * from R_packages")

## Closes the connection

Page 104: Big data analytics graduation project documentation

104 | P a g e

dbDisconnect(con)

4.4 HBase

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.

HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites, including Facebook's Messaging Platform. In the parlance of Eric Brewer’s CAP theorem, HBase is a CP type system.

4.4.1 Connect R with HBase Using RHbase

This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE. The following functions are part of this package

Table Maninpulation hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table

Read/Write hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan, hb.scan.ex

Utility hb.list.tables

Initialization hb.defaults, hb.init

Prerequisites

Installing the package requires that you first install and build Thrift. Once you have the libraries built, be sure they are in a path where the R client can find them (i.e. /usr/lib). This package was built and tested using Thrift 0.8

Page 105: Big data analytics graduation project documentation

105 | P a g e

Here is an example for building the libraries on CentOS:

1. Install all Thrift pre-requisites: http://wiki.apache.org/thrift/GettingCentOS5Packages

2. Build Thrfit according to instructions: http://wiki.apache.org/thrift/ThriftInstallation

3. Update PKG_CONFIG_PATH: export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/

4. Verifiy pkg-config path is correct: pkg-config --cflags thrift , returns: -I/usr/local/include/thrift

5. Copy Thrift library sudo cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/

The Thrift server by default starts on port 9090. [hbase-root]/bin/hbase thrift start

hb.init(host=127.0.0.1, port=9090) 4.5 RHadoop Connecting R with Hadoop using rhdfs The 'Big Data' explosion of the last few years has led to new infrastructure investments around storage and data architectures. Apache Hadoop has rapidly become a leading option for storing and performing operations on big data. Meanwhile, R has emerged as the tool of choice for data scientists modeling and running advanced analytics. Revolution Analytics brings R to Hadoop, giving companies a way to get better returns on their big data investments and extract unique, competitive insights from advanced analytics with the most cost-effective solution on the market. R users can:

Interface directly with the HDFS filesystem from R. Import big-data tables into R from Hadoop filestores via Apache HBASE. Create big-data analytics by writing map-reduce tasks directly in the R language

4.5.1 Using R with Hadoop

You can access data stored in Hadoop HDFS as well as running MapReduce jobs by using standard R functions. Reading and writing are supported via the pipe() command that either receives information from R or transmits data to R. A MapReduce job can be run synchronously or asynchronously as a shell script: both stdout and stderr can be redirected and captured if required.

Page 106: Big data analytics graduation project documentation

106 | P a g e

Revolution Analytics has released a set of R packages callef RHadoop that allow direct access to Hadoop HDFS, HBase, and MapReduce. The HBase and HDFS interfaces provide a number of access functions: the goal of the project was to mirror the Java interfaces as much as possible. mapreduce() is the exception. The input and output directories are specified as would be done with an Hadoop MapReduce job; however, the Map function and the Reduce function are both written in R.

The Rhdfs functions are: • File Manipulations -hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete,

hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get • File Read/Write -hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read,

hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file • Directory -hdfs.dircreate, hdfs.mkdir • Utility -hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists • Initialization –hdfs.init, hdfs.defaults

The HBase functions are: • Table Manipulation –hb.new.table, hb.delete.table, hb.describe.table,

hb.set.table.mode, hb.regions.table • Row Read/Write -hb.insert, hb.get, hb.delete, hb.insert.data.frame,

hb.get.data.frame, hb.scan • Utility -hb.list.tables • Initialization -hb.defaults, hb.init

Page 107: Big data analytics graduation project documentation

107 | P a g e

Putting it all Together

Figure 4.7 RHadoop components

Task Tracker

Task Tracker

Developer

Large Data Set (Log files, Sensor Data)

Map Job

Reduce Job

Map Job

Reduce Job

Map Job

Reduce Job

Map Job

Reduce Job

Task Tracker

Map Job

Reduce Job

Map Job

Reduce Job 2

1

3

4

R

Postgres

HBase

Page 108: Big data analytics graduation project documentation

108 | P a g e

4.5.2 Connecting Hadoop with R “RHadoop”:

RHadoop is a bridge between R, a language and environment to statistically

explore data sets, and Hadoop, a framework that allows for the distributed processing of

large data sets across clusters of computers. RHadoop is built out of 3 components

which are R packages: rmr, rhdfs and rhbase. Below, we will present each of those R

packages and cover their installation and basic usage.

4.5.3 RMR

The rmr package offers Hadoop MapReduce functionalities in R. For Hadoop users, writing MapReduce programs in R may be considered easier, more productive and more elegant with much less code than in java and easier deployment. It is great to prototype and do research. For R users, it opens the doors of MapReduce programming and access to big data analysis.

The rmr package must not be seen as Hadoop streaming even if internally it uses the streaming architecture. You can do Hadoop streaming with R without any of those packages since the language support stdin and stdout access. Also, rmr programs are not meant to be more efficient than those written in Java and other languages.

Finally:

RMR does not provide a map reduce version of any of the more than 3000 packages available for R. It does not solve the problem of parallel programming. You still have to write parallel algorithms for any problem you need to solve, but you can focus only on the interesting aspects. Some problems are believed not to be amenable to a parallel solution and using the map reduce paradigm or rmr does not create an exception.

4.5.4 Rhdfs

The rhdfs package offers basic connectivity to the Hadoop Distributed File System. It comes with convenient functions to browse, read, write, and modify files stored in HDFS.

4.5.5 Rhbase

The rhbase package offers basic connectivity to HBase. It comes with convenient functions to browse, read, write, and modify tables stored in HBASE.

Page 109: Big data analytics graduation project documentation

109 | P a g e

4.5.6 R Installation & configuration:

You must have at your disposal a working installation of Hadoop. It is recommanded and tested with the Cloudera CDH3 distribution. Consult the RHadoop wiki for alternative installation and future evolution. At the time of this writing, the Cloudera CDH4 distribution is not yet compatible and is documented as a work in progress. All the common Hadoop services must be started as well as the HBase Thrift server in case you want to test rhbase.

Below I have translated my Chef recipes into shell commands. Please contact me directly if you find an error or wish to read the original Chef recipes.

Thrift dependency

Note, if my memory is accurate, maven (apt-get install maven2) might also be required.

Figure 4.8 Thrift dependency

Page 110: Big data analytics graduation project documentation

110 | P a g e

R installation, environment and package dependencies:

Figure 4.9 install Rbase

Page 111: Big data analytics graduation project documentation

111 | P a g e

RMR

Figure 4.10 Installing RMR package

Rhdfs

Figure 4.11 Installing Rhdfs Package

Rhbase

Figure 4.12 Installing Rhbase package

Page 112: Big data analytics graduation project documentation

112 | P a g e

Usage

We are now ready to test our installation. Let’s use the second example present on the tutorial of the RHadoop wiki. This example start with a standart R script wich generate a list of values and count their occurences:

Figure 4.13 Standard R script 1

It then translate the last script into a scalable MapReduce script:

Figure 4.14 Converting R script into Mapreduce

The result is now stored inside the ‘/tmp’ folder of HDFS. Here are two commands to print the file path and the file content:

Figure 4.15 Results commands

If you want to go through big data analysis and want to implement and run mapreduce jobs to deploy it through a distributed system, you should start from here.

Page 113: Big data analytics graduation project documentation

113 | P a g e

The first step we should configure and establish the Hadoop distributed file system and start the namenode ,secondary namenode, datanode,job tracker and task tracker.

Figure 4.16 Starting Hadoop

Second step copy the files you want from local directory to hadoop distributed file

system (HDFS) through the following command:

Bin/hadoop dfs -copyFromLocal /home/user/Desktop/temp /user/input_folder

Page 114: Big data analytics graduation project documentation

114 | P a g e

You can only brows the files in HDFS only through hadoop web interface .

Then, we have to establish a connection between hadoop and R-studio where we

will write and run the mapreduce job and make data visualization on the coming output,

to make this we must install some packages on R to help me in connection.

packages courtesy of Revolution R

o Rhdfs -- access to hdfs functions

o Rhbase – access to HBase

o RMR – run MapReduce job

Page 115: Big data analytics graduation project documentation

115 | P a g e

The next step we will implement a mapreduce job to count the number of each word in R language which is placed in master ,the processing required for this job will divided into tasks which will run on the slaves.

Mapreduce job running through rmr package using hadoop streaming which convert R script into a different format that hadoop can deal with it.

Wait until map and reduce completed this will take some time.

Figure 4.17 RStudio interface

Page 116: Big data analytics graduation project documentation

116 | P a g e

After completion of mapreduce job , we should go to the output folder to see the result of mapreduce of word count , you can explore the file and download it from hadoop web interface.

Page 117: Big data analytics graduation project documentation

117 | P a g e

Figure 4.18 Word count example output

Page 118: Big data analytics graduation project documentation

118 | P a g e

Figure 4.19 MapReduce word count example

Page 119: Big data analytics graduation project documentation

119 | P a g e

Figure 4.20 Data visualization using R

Now we can go to the last step : Data visualization

In the given example I want to know the most repeated 5 words from data .

Page 120: Big data analytics graduation project documentation

120 | P a g e

In the next graph you will see the top 5 words with red points

Page 121: Big data analytics graduation project documentation

121 | P a g e

Big Data analytics Case Study [Social Network #tags analysis]:

We have two files the first one is [Facebook hash tags] and the another is [twitter hash

tags] , they contain hash tags that people write , every hash tag written on facebook or

twitter the name of this hash tag written in these files .what is the benefit of making

analysis on this ?

The First file

Page 122: Big data analytics graduation project documentation

122 | P a g e

The second file

Page 123: Big data analytics graduation project documentation

123 | P a g e

After Running the map reduce job that implement word count on this input files the

output will be:

Page 124: Big data analytics graduation project documentation

124 | P a g e

Visualize the results to get the most popular hash tag in the social networks this will

benefit me for hash tags ranking in the list of recommendations when I search for a

specific hash tag.

Page 125: Big data analytics graduation project documentation

125 | P a g e

4.3 Developing Graphical user interface.

1- This is the window appeared After running GUI

Figure 4.21 Window appeared after running GUI.

2- Press the button run hadoop

Figure 4.22 Window appeared during running GUI

Page 126: Big data analytics graduation project documentation

126 | P a g e

3- The button run hadoop is active

Figure 4.23 Hadoop is running when “Run hadoop” .

4- Now hadoop is Running

Figure 4.24 Hadoop is now running

Page 127: Big data analytics graduation project documentation

127 | P a g e

5- Running a jar file and get output (ex: run word count example)

Figure 4.25 Running Word count example .

6- It is opening a new window to get the output file name

Figure 4.26 New window opened for entering output file name.

Page 128: Big data analytics graduation project documentation

128 | P a g e

7- After run the jar file we have to Call Job tracker interface to monitor the job progress

figure 4.27 calling job tracker web interface.

Page 129: Big data analytics graduation project documentation

129 | P a g e

7.1 - in progress maping and reducing for job ID = job_201306211046_0002

Figure 4.28- This snapshot show the progress of the job ID job_201306211046_0002

Page 130: Big data analytics graduation project documentation

130 | P a g e

7.2- The map/reduce job is completed (100%map and 100 % reduce) for jobID job_201306211046_0002

Figure 4.29 Completed MapReduce job

Page 131: Big data analytics graduation project documentation

131 | P a g e

7.3- Press job ID to show its summary

Figure 4.30 press job ID link to show its summary.

Page 132: Big data analytics graduation project documentation

132 | P a g e

7.4- Map/reduce job summary of job ID job_201306211046_0002

figure 4.31 job summary

Page 133: Big data analytics graduation project documentation

133 | P a g e

8- We can show all task trackers progress through the jop tracker web interface as follow :

figure 4.32 show task trackers

Page 134: Big data analytics graduation project documentation

134 | P a g e

9- Show all task trackers in the cluster and Select the task tracker to show its progress

figure 4.33 show all task trackers links .

Page 135: Big data analytics graduation project documentation

135 | P a g e

10- Slave task tracker through applying the job ID job_201306211046_0002

figure 4.34 slave task tracker

Page 136: Big data analytics graduation project documentation

136 | P a g e

11- The progress of one slave task tracker through applying the job ID job_201306211046_0002

figure 4.35 slave task tracker progress

Page 137: Big data analytics graduation project documentation

137 | P a g e

12- After slave task tracker complete its task.

figure 4.36 slave task tracker complete its task.

Page 138: Big data analytics graduation project documentation

138 | P a g e

13- master task tracker through applying the job ID job_201306211046_0002

figure 4.37 master task tracker

Page 139: Big data analytics graduation project documentation

139 | P a g e

14- The progress of one master task tracker through applying the job ID job_201306211046_0002

figure 4.38 master task tracker progress

Page 140: Big data analytics graduation project documentation

140 | P a g e

15- After master task tracker complete its job

figure 4.39 master task tracker complete its task.

Page 141: Big data analytics graduation project documentation

141 | P a g e

16- we have to Call name node interface to browse file systems of both master and slaves

figure 4.40 calling name node web interface

Page 142: Big data analytics graduation project documentation

142 | P a g e

17- Name node interface to browse file systems of all nodes (masters and slaves)

figure 4.41 name node web interface.

Page 143: Big data analytics graduation project documentation

143 | P a g e

18- Active nodes link

figure 4.42 active nodes link.

Page 144: Big data analytics graduation project documentation

144 | P a g e

19- Active nodes

figure 4.43 browsing active nodes .

Page 145: Big data analytics graduation project documentation

145 | P a g e

21.1- If we browse the Master machine , Press User to get the user name of hadoop in the master machine

figure 4.44 browsing master file system

Page 146: Big data analytics graduation project documentation

146 | P a g e

21.2- The user of the master node called “dola” press it to show the directory in which DFS stores it data

figure 4.45 browsing master user to show file system.

Page 147: Big data analytics graduation project documentation

147 | P a g e

21.3- The content of HDFS which include the output file called “Adel_awd_el-agawany” which is entered before

figure 4.46 master file system content including the output file name which is entered before.

Page 148: Big data analytics graduation project documentation

148 | P a g e

21.4- Open output file and show output data

figure 4.47 display output file content.

Page 149: Big data analytics graduation project documentation

149 | P a g e

21.5- The output file content .

figure 4.48 display output file content

Page 150: Big data analytics graduation project documentation

150 | P a g e

20- Call CMD to write any command you want

figure 4.49 calling CMD to write commands

Page 151: Big data analytics graduation project documentation

151 | P a g e

21- Stopping hadoop

figure 4.50 click stop hadoop

22- Hadoop is stopped

figure 4.51 hadoop is stopped

Page 152: Big data analytics graduation project documentation

152 | P a g e

1) we want to run hadoop and run each node through GUI .

we download eclipse-hadoop plugin ( https://code.google.com/p/hadoop-eclipse-plugin/downloads/detail?name=hadoop-0.20.3-dev-eclipse-plugin.jar&can=2&q= ) but it didnt work , also we try to create a new plugin (http://iredlof.com/part-4-compile-hadoop-v1-0-4-eclipse-plugin-on-ubuntu-12-10/) but also didnt work i think thats because we used hadoop-1.1.0

2)Try to change the output web-based interface layout but we could change only the CSS file

3) while running mapreduce job the error we get is that jobtracker in safe mode

disable jobtracjer safemode ( http://stackoverflow.com/questions/7470646/hadoop-job-asks-to-disable-safe-node )

4) while running mapreduce job the error we get is that datastreamer datareplication

disable ubuntu firewall ( http://www.cyberciti.biz/faq/ubuntu-server-disable-firewall/ ) or

reformate namenode ( http://stackoverflow.com/questions/10097246/no-data-nodes-are-started ) or by changing the value from 1 to 2 in hdfs-site.xml

<name>dfs.replication</name>

<value>2</value>

solved by changing permssion of the local folder on desktop(d) to the hadoop user

5) we tried to use RHEL-5.4 but we dont have installation number to use yum package and other packages and there is no workaround way to use it .

Problems

Page 153: Big data analytics graduation project documentation

153 | P a g e

6) Streaming command failed - package rmr Problem:

We have setup a single node Hadoop cluster (v 1.0.3) on Ubuntu and installed the R (2.15) and

RHadoop packages rmr, rhdfs and rhbase. When we try to run the first tutorial program we run

into errors.

Solution:

Reinstall rmr in a system directory rather than a user specific directory. It could very well

be that when R is started as part of Hadoop its libPaths() is more limited than when used

interactively and then chech the hadoop streaming package and the streaming that work under

hadoop environment.

How Streaming Works:

In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.

When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it

22/6/13 11:49:20 ERROR streaming.StreamJob: Job not successful. Error: # of

failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask:

task_201208071147_0001_m_000000

12/08/07 11:49:20 INFO streaming.StreamJob: killJob...

Streaming Command Failed!

Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if

(is.list(input)) { :

hadoop streaming failed with error code 1

Page 154: Big data analytics graduation project documentation

154 | P a g e

converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.

This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.

You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

mapper org.apache.hadoop.mapred.lib.IdentityMapper \

-reducer /bin/wc

User can specify stream.non.zero.exit.is.failure as true or false to make a streaming task that exits with a non-zero status to be Failure or Successrespectively. By default, streaming tasks exiting with non-zero status are considered to be failed tasks.

Page 155: Big data analytics graduation project documentation

155 | P a g e

1) change the block size to imporve mapreduce job , also create more that HDFS home

directory that can help in save data in different directories insted of saving it in one

directory to avoid all the data loss in case of hard disk failure

2) create a second master machine and cluster it with the base master machine to avoid

single point fo failuer ,, first we try to use ubuntu MPICH packge we create the second

machine on external USB storage and run the job from it and then unplug it but it didnt

work secondlly we thought that we want the second master machine to be on a separte

virtual server (VS) and then cluster it with the other VS that holds the first master virtual

machine so we used ESXI-5 server and connect through it using vmware vsphere client

to add,remove and edit the virtual machines also used vmware vcenter converter

standalone client to convert the pre-defined virtual machines ( master and slave

machines ) and upload it on the ESXI server also used vcenter center from windows

server 2008 R2 to cluster the virtual ESXI hosts ,, but because of lake of resources we

couldnt test it also when we put the two master machines on the same server we

wouldnt test it because of slowe respond from the machines

( http://www.sysadmintutorials.com/tutorials/vmware-vsphere-4/vcenter4/clustering-esx-

hosts/ )

3) by using ESXI , vsphere client and vcenter server that can give us advantages and

the power to make use of the server functions ( advantages ) that we make our project

always available , scailable , recovrable , avoid single point of failure , easy to use ,

create a template from a machine to easly create a new slave virtual machine with pre-

defined configuration

What is Next?

Page 156: Big data analytics graduation project documentation

156 | P a g e

Finally, the big data problem solved using Hadoop and R programming language and

we tested it on two files which contains facebook and twitter hashtags. After running the

mapreduce functions on the two files we had the result appeared in chart and table

containg the hashtag names and the number of every hashtag in the wohole file.

Conclusion

Page 157: Big data analytics graduation project documentation

157 | P a g e

https://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/

https://github.com/RevolutionAnalytics

http://blog.revolutionanalytics.com/2012/03/r-and-hadoop-step-by-step-tutorials.html

https://github.com/RevolutionAnalytics/RHadoop/issues/106

https://github.com/RevolutionAnalytics/RHadoop/issues/143

http://pokristensson.com/rplotting.html

http://hadoop.apache.org/docs/stable/streaming.html#Setting+Environment+Variables

http://www.meetup.com/Learning-Machine-Learning-by-

Example/pages/Installing_R_and_RHadoop/

http://blogs.impetus.com/big_data/big_data_technologies/MapReduceWithRPart2.do

http://www.slideshare.net/jeffreybreen/big-data-stepbystep-using-r-hadoop-with-rhadoops-

rmr-package

[PDF] http://www.mapr.com/Download-document/26-RHadoop-and-MapR

http://www.revolutionanalytics.com/products/r-for-apache-

hadoop.phphttp://www.adaltas.com/blog/2012/05/19/hadoop-and-r-is-rhadoop/

http://en.wikipedia.org/wiki/Secure_Shell

http://en.wikipedia.org/wiki/Computer_cluster

http://en.wikipedia.org/wiki/High-availability_cluster

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

hadoop.apache.org/docs/stable/cluster_setup.html

http://harish11g.blogspot.com/2012/02/configuring-cloudera-hadoop-cluster.html

http://thefoglite.com/2012/07/05/create-and-administer-esxi-host-and-datastore-clusters/

http://vinf.net/2010/02/25/8-node-esxi-cluster-running-60-virtual-machines-all-running-from-

a-single-500gbp-physical-server/

References

Page 158: Big data analytics graduation project documentation

158 | P a g e

http://fscked.org/writings/clusters/

http://tldp.org/HOWTO/html_single/Beowulf-HOWTO/

http://www.sysadmintutorials.com/tutorials/vmware-vsphere-4/vcenter4/clustering-esx-hosts/

http://answers.oreilly.com/topic/1524-how-to-create-a-vmware-

cluster/#adding_a_new_cluster_to_a_datacenter

http://communities.vmware.com/thread/229911?start=30&tstart=0

https://help.ubuntu.com/community/MpichCluster