Research on Big data Analytics for different domains

RESEARCH ON BIG DATA

ANALYTICS

Name

Anmol Seth ([email protected])

Divya Mehta ([email protected])

Kalpi Savalia ([email protected])

Rajvi Thakkar ([email protected])

Ahmedabad University Big Data Analytics

1 Summer Internship

Index Abstract: .............................................................................................................................................. 2

Keywords: ............................................................................................................................................ 2

Introduction: ....................................................................................................................................... 2

Literature survey: ................................................................................................................................ 2

Introduction to Big Data: ................................................................................................................. 2

Common sources of big data: ......................................................................................................... 4

Big data analytics: ........................................................................................................................... 4

Challenges in data processing: ........................................................................................................ 4

Hadoop: ........................................................................................................................................... 4

Hadoop vs. Conventional relational database: ............................................................................... 5

Alternatives of Hadoop: .................................................................................................................. 5

Why use Hadoop? ........................................................................................................................... 5

Origin of Hadoop: ............................................................................................................................ 6

Hadoop Architecture: ...................................................................................................................... 6

Hadoop Daemons: ........................................................................................................................... 7

Hadoop has two main Components:............................................................................................... 7

HDFS-Hadoop Distributed File System: ........................................................................................... 7

2 Functions of map reduce:......................................................................................................... 9

Advantages of Hadoop according to - Gennaro (Rino) Persico (Database Software Leader at .... 10

Case Studies in different domains: ................................................................................................... 10

1) Social Media: ............................................................................................................................. 10

1) Retail: ........................................................................................................................................ 11

3) Natural Calamity: ...................................................................................................................... 11

4) Social Media: ............................................................................................................................. 11

Sentiment Analysis on tweets from twitter for a particular Hash tag: ............................................. 12

Results for #Maggi: ....................................................................................................................... 12

Results for #SalmanVerdict: .......................................................................................................... 13

Future Scope: .................................................................................................................................... 14

References:........................................................................................................................................ 14


2 Summer Internship

Abstract: The size of the databases used today has been growing at exponential rates day by day. The explosion of

information systems, Internet and web based application, social networks and new technologies have given

rise to humongous amount of information, also known as “Big Data”. There are billions of active users on

social media with terabytes and petabytes of information loaded each day. For storing and extracting useful

information, which is quite useful for decision-making, from this gigantic information are difficult using

traditional methods of processing data. Hadoop is one of the tools that can manage data using HDFS and

MapReduce and other components. This research gives you an insight of Big Data and Hadoop and how

sentimental analysis using text mining is done in RapidMiner.

Keywords: Big Data, Social media analytics, word sentimental analysis, Hadoop, RapidMiner.

Introduction: In this techno age, with increasing devices and new upcoming technology, the databases are overwhelmed

by data entered each day. Data is generated through many sources viz. transactions, business processes,

social networking sites, web based applications, web servers, etc. Mostly business processes use huge data

for introducing new schemes or products. Today’s business and politics face fierce competition. In business

industry, they use big data technologies for decision-making and generating business intelligence. Politicians

use them for analysis citizen’s behavior. Processing this enormous amount of data or extracting meaningful

information is a tedious task.

Literature survey:

Introduction to Big Data:

Generally people have this misconception that Big data is a new technology, but actually Big data is not a

technology rather it is a term for a collection of huge massive collection of large and complex datasets. It

is not limited to the data lying in servers, but the term big data actually refers to each and every piece of

data that has stored in an organization till now. Data is mainly of three types: Structured, Unstructured

and Semi-Structured [Dai Clegg,”Semi-Structured data analytics: Relational or Hadoop platform? Part 1”,

IBM blog, Available URL:”http://www.ibmbigdatahub.com/blog/semi-structureddata-analytics-

relational-or-hadoop-platform-part-1”, accessed on].

http://www.ibmbigdatahub.com/blog/semi-structured-data-analytics-relational-or-hadoop-platform-part-1





















3 Summer Internship

Figure: 1 [[Wayne Eckerson,”Big Data Analytics: Profiling use of analytical platforms in user organizations”, Available URL: “http://www.slideshare.net/calidadgmv/big-data-analyticsresearch-

report-29783323”accessed on:]]

The term-structured data generally refers to data that has a defined length and format, is usually stored in

a database.

Unstructured data is not properly defined and structured. Over 80% of data is unstructured. [IBM

Business Analytics, “IBM Smarter Analytics”, YouTube video, Available

URL:”https://www.youtube.com/watch?v=3rFtrCzfWVQ”, accessed on]

Semi-structured data is also often called instrumentation data, when it comes from sensors or other

instrumented sources.

In any sector, such data is considered extremely valuable. The amount of data in the world is exploding.

Data is growing at a 40 percent compound annual rate, reaching nearly 45ZB by 2020 [Jeff Heaton,”Big Data

Presentation at Maryville University, March 25, 2015 Big Data”, YouTube video, Available

URL:”https://www.youtube.com/watch?v=l2lYXX2hwaI”, accessed on].

1ZB=1 Billion TB. Today data is so voluminous that it is difficult to store it in disks or in centralized systems,

thus data processing is shifting to distributed architecture.

Three V’s of big data according to IBM: Volume, Velocity, and Variety.

• Volume: Huge massive amount of data.

• Velocity: The rate at which data is been generated on the internet.

• Variety: It is a collection of different kinds of data.

http://www.slideshare.net/calidadgmv/big-data-analytics-research-report-29783323












https://www.youtube.com/watch?v=3rFtrCzfWVQ


https://www.youtube.com/watch?v=l2lYXX2hwaI



4 Summer Internship

Common sources of big data:

R. A. Fadenavis, Samrudhi Tabhane, (date), “Big Data Processing Using Hadoop”: Data generated from “the

internet of things”, sensors, mobile devices, search engines, social networking websites, scientific data &

enterprises; all are contributing to this huge explosion in data.

Gali Halevi, Dr.Henk Moed, (30-Sept-2012),”The Evolution of Big Data as a Research and Scientific

Topic”: Big Data can be seen in the finance and business where enormous amount of stock exchange,

banking, online and onsite purchasing data flows through computerized systems every day and are then

captured and stored for inventory monitoring, customer behavior and market behavior. It can also be seen

in the life sciences where big sets of data such as genome sequencing, clinical data and patient data are

analyzed and used to advance breakthroughs in science in research.

Big data analytics:

Big data analytics [http://www.edupritstine.com/courses/big-data-hadoop-program/big-data-

hadoopcourse] refers to the process of collecting, organizing and analyzing large sets of data ("big data") to

discover patterns and other useful information.

Big data requires tools and methods that can be applied to analyze the available large data and extract

patterns from it. Hadoop is one such big data tools to analyze big data.

Challenges in data processing:

(19-May-2015),”Challenges and Opportunities with Big Data”:

Capturing data from various sources, Filtering of that data, Storing is a major challenge, Searching, Scaling,

Sharing, Transferring, Analyzing, Presentation, Timeliness, Privacy, Human Collaboration and Processing of

data.

The major disadvantage with data warehouse was that the algorithm ran on a small sample of data collected

from different sources. The process of collecting the data, cleaning and then running having analytics on

them, that takes huge turnaround time which makes the results go stale. It was a lack of capability to process

the lots of data and a ability to do it quickly, which started to cause a problem.

Other challenges were like increase in volume and scalability of data in RDBMS systems. To overcome the

challenges of RDBMS, distributed systems were introduced. But problem arises in distributed systems

when a single machines crashes, which will need manual check. Thus to overcome above challenges,

Hadoop was introduced.

Hadoop:

Hadoop – Is an amalgamation of open source programs and procedures, which anyone can use as a core

ingredient for their big data operations [Bernard Marr,”What’s Hadoop? Here’s Simple Explanation for

Everyone”, Available URL” http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-

ssimple-explanation-everyone”, accessed on].

http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-s-simple-explanation-everyone

















5 Summer Internship

Hadoop vs. Conventional relational database:

-In Hadoop data is distributed across many nodes and processing of that data is distributed. Whereas in

conventional relational database, conceptually all the data sits on one server and one database.

-In Hadoop, once you've written data you are not allowed to modify it, you can delete but u cannot modify

it but in relational database data can be written, any times like balance on your account.

-Hadoop is optimized for, once you've written the data, you don't want to modify it. If it's archival data about

telephone calls or transactions you don’t want to change it once you written it but in relational database we

always use SQL.

-Hadoop does not support SQL at all; it supports lightweight versions of SQL called NOSQL but not

conventional SQL.

Alternatives of Hadoop:

Following are the different alternatives of Hadoop:

• BashReduce

• Disco Project

• Spark

• GraphLab

Why use Hadoop?

In spite of these above alternatives, Hadoop is much more popular because of the following reasons:

• Data is distributed to multiple systems and thus parallel processing can be carried out.

• Hadoop is highly available: Data is always available and process never stops.

• Fault tolerance: Even if there is any fault, still process never stops.

• High throughput: Due to parallelism, Hadoop is used to achieve high throughput.

Popular companies working on Hadoop project development:

• Apache

• Cloudera

• Hortonworks

• Intel

• IBM. Etc.


6 Summer Internship

Origin of Hadoop:

Hadoop is an open source project developed by Dough Cutting and his team, after being inspired by Google’s

White Paper.

Finding something relevant 10 years ago was very difficult. Today search engines provides best results in few

milliseconds. So back then, it was at that time itself that the industries have realized that they need clusters

of computers to process big amount of data, as the data to be processed was too much for the capacity of a

single powerful server. Moreover, the data was constantly increasing, and changing, at that point yahoo

started to face the big data problems, as they were looking at the data from the complete World Wide Web.

They even after developed their own versions of distributed computing.

Google came and introduced papers on Google file system and MapReduce. At the time, these were just the

high level designs.

Later Doug Cutting and Mike Cafarella who were working in yahoo, made improvement and in the end of

2005, this project was named as HADOOP.

In august 2010, Doug joined Cloudera. In 2013, Apache released a stable version of Hadoop2, which is also

known as YARN. All versions of Hadoop was open source and managed by Apache.

Hadoop was derived from Google’s MapReduce and Google file system. Yahoo was originator and has been

a major contributor and uses Hadoop across its business.

Hadoop Architecture:

Figure: 2 [[Antonio Cangiano,” Why Big SQL”, AvailableURL:“http://sqlforhadoop.com/2013/05/why-

bigsql/”accessed on:]]

Hadoop comprises of multiple concepts and modules like HDFS, Map-Reduce, HBASE, PIG, HIVE, SQOOP and

ZOOKEEPER to perform the easy and fast processing of huge data.

[http://www.edupritstine.com/courses/big-data-hadoop-program/big-data-hadoop-course]

http://sqlforhadoop.com/2013/05/why-big-sql/







7 Summer Internship

ZooKeeper: A high-performance coordination service for distributed applications.

Pig: A high-level data-flow language and execution framework for parallel computation.

Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

HBase: A scalable, distributed database that supports structured data storage for large tables.

[Dominique A. Heger,” Hadoop Design, Architecture & MapReduce Performance”, Available URL”

http://www.datanubes.com/mediac/HadoopArchPerfDHT.pdf”, accessed on].

Hadoop is a software which provides some classes and interfaces. It is conceptually different from Relational

databases and can process the high volume, high velocity and high variety of data to generate value.

Hadoop Daemons:

There are five Daemons running in background:

• NameNode

• DataNode

• JobTracker

• TaskTracker

• Secondary NameNode

Hadoop has two main Components:

HDFS-Hadoop Distributed File System:

It is like container of the Hadoop system, it is designed to store large amount of data. It creates multiple data

blocks and stores each of the blocks redundantly across the pool of servers to enable reliable, extreme rapid

computation.

• HDFS is highly fault tolerant and it is designed to be deployed on low cost hardware.

• HDFS is suitable for applications that have large dataset.

• HDFS maintain the metadata in a dedicated server called NameNode and the application data are

kept in separate nodes called Data Node.

• Used for processing data Map Reduce

• Used for storing data. HDFS

http://www.datanubes.com/mediac/HadoopArchPerfDHT.pdf

http://www.datanubes.com/mediac/HadoopArchPerfDHT.pdf


8 Summer Internship

Figure: 3 [[Antonio Cangiano,” Why Big SQL”, AvailableURL:“http://sqlforhadoop.com/2013/05/why-bigsql/”accessed on:]] NameNode: HDFS namespace is a hierarchy of files and directories. NameNode knows everything about a

given file. If client asks to read a file, request is sent first to the NameNode. It maintains metadata.

This metadata contains two files:

• FSImage

• Editlogs.

First time whenever data is stored, an initial image of a file is stored in FSImage, then updates of every blocks

is maintained in Editlogs.

Files and directories are represented on the NameNode using inodes which record attributes like filename, block details, replication factor, permissions modification and access time, namespace and

disk space quotas.

The file content is split into blocks (typically 128MB) each block of file is dependently replicated at multiple

DataNodes. NameNode maintains the mapping of file blocks to DataNodes.

DataNodes: Two files represent each data block in the host native file system, one file contains the data itself

and the other contains block’s metadata.

Client machine: neither a NameNode nor a Data Node, Client machines have Hadoop installed on them.

They’re responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of

the job once complete.

MapReduce:

MapReduce is a tool/program written in java for managing and processing vast amounts of unstructured

data in parallel based on division of a big data item in smaller independent task units.








9 Summer Internship

Hadoop is not really a database it only stores data and you can fetch data out, but there are no queries

involved like SQL or any other. Multiple mappers run on different parallel machines.

The main aim of MapReduce is: it groups tasks of similar nature so that type of data is placed on same nodes.

2 Functions of map reduce:

Mapper’s tasks to process the records

Reducer tasks are written to use the output of mappers (coming from multiple servers) and merge the results

together.

Once the tasks have been created, they’re spread across multiple nodes and run simultaneously (the “map”

step). The “reduce” phase combines the results together.

The JobTracker and TaskTracker handle delegation of tasks.

• JobTracker: The JobTracker oversees how MapReduce jobs are split up into tasks and divided

among nodes within the cluster.

• TaskTracker: The TaskTracker accepts tasks from the JobTracker, performs the work and alerts

the JobTracker once it’s done.

Figure: 4 [[Praveen Kumar,” Efficient Capabilities of Processing of Big Data using Hadoop Map Reduce”,

AvailableURL:“http://www.ijarcce.com/upload/2014/june/IJARCCE7F%20s%20praveen%20vashisht%20

Efficient%20Capabilities.pdf”accessed on:]]

From HDFS data is sent to different mappers resided on different machines. They are shuffled and then send

to reduction phase.

http://www.ijarcce.com/upload/2014/june/IJARCCE7F%20s%20praveen%20vashisht%20Efficient%20Capabilities.pdf





10 Summer Internship

Advantages of Hadoop according to - Gennaro (Rino) Persico (Database Software

Leader at IBM)

• Hadoop does not require any specialized hardware to process data.

• Hadoop processes move closer to the data.

• Data and Server failures are managed: Hadoop HDFS replicates data blocks among the nodes

and a single failure does not impact the processing.

• Query and manage with “Hive”: Apache Hive is built on the top of Hadoop and it is used to query and manage data in distributed storage.

• Why not? If your data are not documents or contents or your data is composed by records or

your workload is transactional, you require database services.

[Bernard Marr,”What’s Hadoop? Here’s Simple Explanation for Everyone”, Available URL”

http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-s-simple-explanationeveryone”,

accessed on]

Disadvantages of Hadoop: 1. Cluster management is hard: - [ http://www.j2eebrain.com/java-J2ee-hadoop-advantages-

anddisadvantages.html] In the cluster, operations like debugging, distributing software; collection

logs etc. are too hard.

2. Security Concerns: Hadoop security model is disabled by default so whoever uses the platform and

does not know to enable it than your data might be at huge risk.

3. Vulnerable by nature: [ http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-forbig-

data/] As the Hadoop framework is written in java, which is very popular language, Java has been

heavily exploited by cybercriminals and as a result there are chances of data being hacked.

Case Studies in different domains:

1) Social Media:

Facebook used commercial RDBMS, but the data was growing too fast and the infrastructure was not

adequate. Therefore, Facebook switched to Hadoop. Facebook uses Hadoop in variety of ways like log

processing, recommendation systems, and data warehouse. Data is stored in Hadoop/Hive Warehouse and

Hadoop Archival Store. It is heavily used for sentiment analysis, business intelligence, emotion tracing, and

many other applications. [Ashish Thusoo, “Hive - A Petabyte Scale Data Warehouse using Hadoop”, Available

URL: “https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-datawarehouse-

using-hadoop/89508453919,”accessed on: 10-Jun-2009].

https://www.linkedin.com/in/gennaropersico?trk=pulse-det-athr_prof-art_hdr























http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html














http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/
















http://www.darkreading.com/attacks-and-breaches/java-under-attack-again-disable-now/d/d-id/1108136?





https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919,




















2) Retail:

Walmart started using the term Big Data before it was known in the industry. They use 250-node Hadoop

cluster. It used Polaris, a semantic search engine. Its objective is to deliver meaningful results on the

shopper’s keywords and using constellation of methods that takes shopper’s interest and intuit his or her

interest. [Rick Moss,”Walmart.com’s improved Search engine powered by ‘Social Genome’”, Available

URL:“http://www.retailwire.com/discussion/16260/walmart-coms-improved-search-enginepowered-by-social-genome”, accessed on:]

3) Internet of things:

According to Cisco estimation by 2020, there will be 11 billion devices will be connected with each other. By using an enterprise-grade Hadoop architecture, you can create enormous opportunities for your business. The partnership between Cisco and MapR provides you with an infrastructure that can readily

handle the massive amounts of real-time big data from the Internet of Things. [Rajeev Tiwari, “Functions

that can benefit from Hadoop”, Available URL: http://www.dqindia.com/functions-that-can-benefitfrom-

hadoop/, accessed on:]

4) Natural Calamity:

An innovative company called Terra Seismic believes that earthquakes can be predicted 20-30 days before

they occur. By using Big data and information from satellite, they collect data each of the region where there is more probability of an earthquake.

This data is then combined with a huge number of earthquake originators such as ground water level changes, sudden clouds, changes of the ground conductivity, geomagnetic and gravity anomalies, etc.

Hadoop cluster is to support increased dataset of this information and takes less job execution time with increase in number of nodes. [Srikanth R P, “Could Nepal Earthquake have been predicted by using Big Data?”

Available URL: “http://www.dqindia.com/could-the-nepal-earthquake-have-been-predicted-byusing-big-data/” accessed on: 4-may-2015]

5) Social Media:

Facebook used commercial RDBMS, but the data was growing too fast and the infrastructure was not

adequate. Therefore, Facebook switched to Hadoop. Facebook uses Hadoop in variety of ways like log

processing, recommendation systems, and data warehouse. Data is stored in Hadoop/Hive Warehouse and

Hadoop Archival Store. It is heavily used for sentiment analysis, business intelligence, emotion tracing, and

many other applications. [Ashish Thusoo, “Hive - A Petabyte Scale Data Warehouse using Hadoop”, Available

URL: “https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-datawarehouse-

using-hadoop/89508453919,”accessed on: 10-Jun-2009].

http://www.dqindia.com/functions-that-can-benefit-from-hadoop/
































Sentiment Analysis on tweets from twitter for a particular Hash tag: For the above analysis, we have used RapidMiner.

RapidMiner is a software platform through which we can perform data mining, text mining, predictive

analytics and business analytics.

Following are the steps that were followed for analysis:

1. A Dataset is to be loaded in RapidMiner to perform sentiment analysis.

2. Using NodeXL i.e. a Microsoft template, it’s free and open-source software package for network

analysis and visualization, we brought dataset related to #Maggi from twitter.

3. After loading dataset related to #Maggi in RapidMiner, we performed few steps from the link

[http://blog.aylien.com/post/98466399268/analyzing-text-in-rapidminer-part-1].And below is the

result obtained:

Results for #Maggi:

Hence, from the polarity we can come to know what are the views of people on the ban of Maggi.

http://blog.aylien.com/post/98466399268/analyzing-text-in-rapidminer-part-1















Results for #SalmanVerdict:

Hence from the polarity we can come to know what are the views of people regarding court’s verdict on

Salman Khan Hit and run case.

Tweet analysis for #SalmanVerdict:

Tweet analysis for #Maggi

negative 40 %

positive 20 %

neutral 40 %

Tweet Analysis: #Maggi

positive 47 %

negative 20 %

neutral 33 %

#SalmanVerdict



Future Scope: • Practical implementation of Big Data Analytics using Hadoop.

• Voice sentiment analysis of records.

References: [1] https://www.youtube.com/watch?v=l2lYXX2hwaI

[2] R. A. Fadenavis, Samrudhi Tabhane, (date), “Big Data Processing Using Hadoop”

[3] http://www.ibmbigdatahub.com/blog/semi-structured-data-analytics-relational-or-hadoopplatform-part-1

[4] https://www.youtube.com/watch?v=3rFtrCzfWVQ

[5] http://www.edupritstine.com/courses/big-data-hadoop-program/big-data-hadoop-course

[6] Unknown, (19-May-2015),”Challenges and Opportunities with Big Data”

[7] Gali Halevi, Dr.Henk Moed, (30-Sept-2012),”The Evolution of Big Data as a Research and Scientific Topic”

[8] http://www.slideshare.net/calidadgmv/big-data-analytics-research-report-29783323

[9] http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-s-simple-

explanationeveryone

[10] http://blogs.forrester.com/mike_gualtieri/13-06-07-what_is_hadoop

[11] http://www.dqindia.com/functions-that-can-benefit-from-hadoop/

[12] http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html

[13] http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/

[14] https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouseusing-hadoop/89508453919

[15] http://www.dqindia.com/could-the-nepal-earthquake-have-been-predicted-by-using-big-data/

[16] http://www.retailwire.com/discussion/16260/walmart-coms-improved-search-engine-powered-

bysocial-genome





















































http://blogs.forrester.com/mike_gualtieri/13-06-07-what_is_hadoop




































https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouse-using-hadoop/89508453919


















http://www.dqindia.com/could-the-nepal-earthquake-have-been-predicted-by-using-big-data/






















http://www.retailwire.com/discussion/16260/walmart-coms-improved-search-engine-powered-by-social-genome


















Research on Big data Analytics for different domains

Data & Analytics