Top Banner
BIG DATA: A 360 ° Overview Juvénal CHOKOGOUE M Consultant Business Analytics – Big Data BD-DE-0005 11/23/2014
29

Big Data : a 360° Overview

Jul 08, 2015

Download

Data & Analytics

When writing this new paper, my main objective was to provide a clear understanding of where the term "Big Data" comes from, why is that term so popular now, what does it really mean and what can be its implication for businesses. Because the full power of Big Data can be revealed only by Analytics, i provided a description of a widely recognized and used analytical techniques to help you figure out how used in conjunction with Big Data, analytics can boost Business Performance.
i expected that by the end of this paper :
- you will smile the next time you read or hear at the terms big data, hadoop, or analytics :)
- you will understand the technologies that are behind the scene when one talks about "Big Data"
- you will know how to "make sense" of Big Data using Analytics
- you will get a basic idea of data mining techniques used in Business in general and in Big Data in particular
- you will be able to get every news about Big Data
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data : a 360° Overview

BIG DATA: A 360° Overview

Juvénal CHOKOGOUE MConsultant Business Analytics – Big Data

BD-DE-0005

11/23/2014

Page 2: Big Data : a 360° Overview

• The Business Challenge

• What this module Stands for ?

• Who is this module for ?

• Before the battle begins

• Anyway! What is Big Data ?

• Big Data and Analytics: How these two married together?

• Analytical Techniques for Mining Big Data

• The New Infrastructure for Data Management : Hadoop

• Big Data adoption : Now or Later ?

• The Next Steps

• What Should i remember ?

• Some Big Data Providers

• Bibliography & Resources

• About me

Module Overview

Page 3: Big Data : a 360° Overview

The Business Challenge• Scaling operations up and down as

conditions change and ability to

Decrease “time to market” for decision-

making are become a critical

competitive differentiator in today’s

economy.

• Companies are gathering more and

more data to stay competitive.

• If they want to decrease their “timeto market”, they must make sense of the

intersection of all these different kind of

data they have gathered.

• Technically, when you are dealing

with so much data in so many different

forms, it is impossible to think about

data management in traditional ways.

• The challenges and opportunities

associated with this new kind of data

management problem is known today

as "Big Data"

Page 4: Big Data : a 360° Overview

What this module Stands for ?Like in any other technological concept that pops up, Software Companies are

always fighting against definitions in order to sell their products, confusing and leaving

businesses a confuse idea of the concept and of where that concept fit in the issues they have

to face. Big Data, like any other concept such as Cloud Computing, Virtualization, Data mining

and so on, is just one of these concept.

i expected that by the end of this paper :

• you will smile the next time you read or hear at the terms big data, hadoop, or analytics :)

• you will understand what are behind the scene when one talks about "Big Data"

• you will know how one can "make sense" of Big Data using Analytics

• you will get a basic idea of data mining techniques used in Business and in Big Data

• you will be able to get every news about Big Data

So, Keep hearing…

Page 5: Big Data : a 360° Overview

What this module Stands for ?Like in any other technological concept that pops up, Software Companies are always fighting

against definitions in order to sell their products, confusing and leaving businesses a confuse idea of the

concept and of where that concept fit in the issues they have to face. Big Data, like any other concept such

as Cloud Computing, Virtualization, Data mining and so on, is just one of these concept.

When writing this paper, my main objective was to provide really a 360 ° overview of Big Data,

that is a clear understanding of where the term "Big Data" comes from, why is that term so popular now,

what does it really mean and what can be its implication for businesses. Because Analytics is another term

that is associated to Big Data, i provided a description of a widely recognized and used analytical

techniques to help you figure out how used in conjunction with Big Data, analytics can boost Business

Performance.

So, please don't lend me words; this paper does not intent to as a “how-to” neither for a big

data project management, nor for big data application development, nor for Statistical Model Building.

Those will be the subject of other papers. Rather, i expected that by the end of this paper :

• you will smile the next time you read or hear at the terms big data, Hadoop, or analytics :)

• you will understand what are behind the scene when one talks about "Big Data"

• you will know how one can "make sense" of Big Data using Analytics

• you will get a basic idea of data mining techniques used in Business and in Big Data

• you will be able to get every updates about Big Data

So, Keep Reading…

Page 6: Big Data : a 360° Overview

Before the battle begins information provided here is for informational purposes only and represents my current point of view as of

the date of this presentation. Due to changing conditions of market, information provided here can be

modify or obsolete, it should not be interpreted to be a commitment and I cannot guarantee its accuracy

after the date of this presentation.

Contents of websites provided here can be modify or change, or the website itself can be unavailable after

the publication of this presentation. So I can not MAKES warranties, express, implied or statutory, as to the

information in this presentation.

In this presentation, i choose to call the "Analyst" the person who is responsible for data management,

analytics, and programming Job. It is just a simplification that i adopted to avoid you of being worried by the

new jobs/terms created by Big Data and help you focus on the content of the paper.

Microsoft, SQL Server, Teradata, Oracle, Google, Hadoop, Cloudera, HortonWorks, SAS, EMC and other

names and products cited here are or may be registered Trademarks in the U.S. and/or in other countries.

Feel free to share this module with anyone you know, from your colleagues to your friends, but in this case,

don’t forget to mention the name of the author.

You can use and change the content of this module at your own but I will not be responsible of it content

in this case.

This module is not for sale, If you intend to use it to your own, please, don’t commercialize it !

Page 7: Big Data : a 360° Overview

Anyway! What is Big Data ?

Page 8: Big Data : a 360° Overview

• According to Gartner : "Big data ishigh-volume, high-velocity and high-variety information assets that demandcost-effective, innovative forms ofinformation processing for enhancedinsight and decision making.“

(http://www.gartner.com/it-glossary/big-

data/)

From all definitions provided for Big Data, the definition of Gartner

is the most widely adopted for describing Big Data. And from that definition,

one thing Is clear : when one uses the term Big Data, it is to designate data

that is large in volume , has a high velocity and is available in wide variety . This

is often refer to as the “3-V” or the 3 Dimension of Big Data.

Page 9: Big Data : a 360° Overview

Big Data and Analytics:

How these two married together?

Page 10: Big Data : a 360° Overview

Taken alone, Big data is technology-driven. If Businesses want to capitalize on their Big Data

paradigm, they have to find a way to combine their traditional business analysis techniques they used

in the past to query and dive through the data.

But with extremely wide variety of data comes new challenges. Most of traditional business analysis

techniques are not suitable for the new kind of data sources we have today and that is where

Analytics comes into play!

Analytics design the means by which businesses gain insight from data whatever its source, its size

and even its format.

Page 11: Big Data : a 360° Overview

All this said, you can now understand

that Big Data Analytics is the concept

that design the new means by which we

extract insights from data that are

extremely large, extremely varied and

extremely swift.

• However, Be aware that the

efficiency of Analytics depends

fundamentally on the question you want

to answer, and on the Quality of data.

Data quality issues must be consider

prior to analytics concern. As it is said in

the field: "Garbage in, Garbage out".

• Analytics techniques must be

handle with cautious and require a

formal training in the field. you may

consider to invest in acquiring an

analytics professional

Page 12: Big Data : a 360° Overview

Thirdly, analytics is not a "silver bullet"

that will always give you insights.

fourthly, Just Because You Have Insights

Does not Guarantee You Have The

Power To Act on Them, that is Analytics

can provide insights, but turning

insights from numbers into competitive

advantage may require changes that

your business can’t afford, or simply

doesn’t want to make. The Harvard

Business Review explores a case study

where through big data it was learned

“that he could increase profitssubstantially by extending the time thatitems were on the floor before and afterdiscounting. Implementing thatchange, however, would have required acomplete redesign of the supply chain,which the retailer was reluctant toundertake.” (source

:https://hbr.org/2013/12/you-may-not-

need-big-data-after-all/ar/1)

Analytics does not replace your business intuition. It

just make you feel more confident about your choice.

you may at the end consider your experience and your

intuition as a manager to take the decision.

Page 13: Big Data : a 360° Overview

Analytical Techniques for Mining Big

Data

Page 14: Big Data : a 360° Overview

in this part, i am going to talk only about

some techniques i am certified in. These

techniques are used in most business

scenarios and have showed their proof long

ago.

These techniques are : Regression( Linear and

Logistic), Decision Trees, K-Means, Times

Series, Neural Network, Association Rules,

Naive Bayes and Survival Analysis. In addition,

i am going to present Text Analytics

fundementals, since in Big Data age, we are

generating more and more text data (tweets,

facebook comments..).

- Regression

regression focuses on the relationship

between an outcome and its input variables.

Here, we are predicting how changes in

individual drivers affect the outcome. the

outcome can be continuous or discrete. When

it is discrete, we are predicting the probability

that the outcome will occur. When it is

continuous, we are predicting the value of the

dependent variable given the independenta survey from TDWI

Page 15: Big Data : a 360° Overview

- Decision Trees

Decision Trees are a flexible method very

commonly deployed in classification and

regression problems. Decision trees partition

large amount of data into smaller segments

by applying a series of rules in the form "if condition THEN expression" (eg: if age less

than 30 and revenue greater than 36000 then

class = 'Rich'). Decision trees are visually

represented as upside-down trees with the

root at the top and branches emanating from

the root. There are two types of trees:

Classification Trees and Regression trees.

- K-Means

K-means is a clustering method, it enter in

the category of Exploratory Data Analysis

Methods called "Unsupervised Classification".

The goal is to group data based on similarities

in input variables with no target or specific

outcome. It is the preferred method for

segmentation & Profiling.

a survey from TDWI

Page 16: Big Data : a 360° Overview

-Times Series

Time Series Analysis provides a scientific methodology for

forecasting. Time Series Analysis is the analysis of a

phenomenon that has a temporary evolution. The main

objectives in Time Series Analysis are:

• To understand the underlying structure of the time series

by breaking it into trend, seasonality, and noise.

• Fit a mathematical model to forecast the future.

- Neural Network

Artificial Neural Network are class of flexible non-linear

models used for prediction problems. The power of the

neural network comes from the fact that they can

approximate virtually any continuous association between

the inputs and the target, whatever the kind of relationship

associate them. There are many kind of Neural Network,

but the most widely used is the Multi Layer Perceptron

(MLP).

- Association Rules

Also known as association rules discovery or Market

Basket Analysis or affinity analysis, association rule is a

popular data mining method for exploring associations

between items (data). It is an unsupervised method for in-

database mining over transactions in databases.

Page 17: Big Data : a 360° Overview

- Naive Bayes

Naive bayes is a "Classifier", that is it is used to classify or

assign labels to objects based on applying Bayes theorem

with strong naïve independence assumptions. Naive

Bayes is specifically suited for problems where you have a

categorical inputs with lot of levels.

- Survival Analysis

Survival analysis is a class of statistical methods for

studying the occurrence and timing of events. It is suitable

for problems where you want to know WHEN a specific

event will happen. . Most common approach to build a

survival model are the following : Life Tables, Kaplan-Meier

estimators, exponential regression, proportional hazards

regression, competing risk models and discrete-time

methods.

- text analytics fundamentals

Text analytics is the process of analyzing unstructured text,

extracting relevant information, and transforming it into

structured information that can then be leveraged in

various ways. The analysis and extraction processes take

advantage of techniques that originated from

computational linguistics (Natural Semantic Language),

statistics, and other computer science disciplines.

Page 18: Big Data : a 360° Overview

The New Infrastructure for Data

Management : Hadoop

Page 19: Big Data : a 360° Overview

• The centralized process for data processing is no more efficient

nowadays !

• To deal with Big Data, the idea is to distribute the storage of

data and parallelize the processing of that data across several

cluster of computers: the Cluster computing infrastructure.

• In cluster computing :

- data Files are stored redundantly.

- Computation are divided into tasks and parallelized

• The redundancy of the data on multiple hard disk is supported

via a new kind of file system called the "Distributed File System"

(DFS) and the parallelism of the processing is performed via a

new kind of programming model called "MapReduce".

• The Most popular (and yet mature) implementation of

MapReduce is called "Hadoop". Hadoop comes along with the

HDFS (Hadoop Distributed File System)

• Yes, you got it! You can use an implementation of MapReduce to

manage many large-scale data computations in a way that is

tolerant of hardware fault.

6.1 The New data management strategy

A cluster computing environment

Map Reduce Job Description

Page 20: Big Data : a 360° Overview

• Hadoop is a platform that implements

MapReduce and provide a redundant, reliable

and distributed file system optimized for large

files.

• In reality, Hadoop is just a set of Java classes

(theses classes can also be written into other

programming languages such as Python, C#,

C++,...) for HDFS types and MapReduce job

management.

• Theses classes allow the analyst to write

functions that will get insight from data

without having to worry about how his code is

distributed and parallelized in the cluster

environment.

• To get out the most of a Hadoop cluster , a set

of technologies and tools have been

developed. These set of tools forms today

what is convenient to call : the Hadoop

Ecosystem.

• The most foundational tools of the Hadoop

Ecosystem are the following: Pig, Hive, HBase,

Sqoop, Zookeeper & Mahout.

6.2 The Hadoop Ecosystem

Page 21: Big Data : a 360° Overview

- Pig

Pig is an interactive data flow (or script-

based) language and execution environment

for Hadoop. Pig provides a data flow

language called Pig Latin that allows to

express a series of operations to apply to an

input data to produce output.

- Hive

Hive is an interactive and batch query

language based on SQL for building

MapReduce jobs. It provides users who know

SQL with a simple SQL-like implementation

called HiveQL.

-HBase

HBase is a distributed, column-oriented

database that utilizes HDFS as its persistence

store and supports MapReduce and point

queries. It is capable of hosting very large

tables (billions of columns/rows) because it

is layered on Hadoop clusters of commodity

hardware.

1 CREATE TABLE records (year string,

temperature INT, quality INT) ROW FORMAT

DELIMITED FIELDS TERMINATED BY '\t' ;

2 LOAD DATA LOCAL 'data/sample.txt'

OVERWRITE INTO TABLE records ;

3 SELECT year, MAX(temperature) FROM records

WHERE temperature !=9999 AND (quality == 0

OR quality == 1) GROUP BY year ;

eg of a Pig script : finding the Maximum

temperature by year

1 records = LOAD 'data/samples.txt AS (year:

chararray, temperature : int, quality: int);

2 filtered_records = FILTER records BY

temperature !=9999 AND (quality ==0 OR

quality == 4);

3 grouped_records = GROUP filtered_records BY

year ;

4 Max_temp = FOREACH grouped_records GENERATE

group, MAX (filtered_records.temperature)

5 DUMP max_temp ;

The same previous example written in HiveQL

Page 22: Big Data : a 360° Overview

- Sqoop

Sqoop (SQL-to-Hadoop) efficiently transfers data

from Hadoop HDFS to structured Relational

Databases and vice-verça. Look at Sqoop as the

ETL (Extract - Transform - Load) for an Hadoop

environment.

- Zookeeper

Zookeeper provides a distributed configuration

service, a synchronization service and a naming

registry for distributed applications. Zookeeper is

Hadoop’s way of coordinating all the elements of

these distributed applications.

-Mahout

Mahout is a scalable machine learning and data

mining library for Hadoop. Look at Mahout as the

analytic software for an Hadoop environment.

Mahout provides data mining and machine

learning algorithms packaged in Java libraries to

perform 4 types of analysis in an Hadoop

environment: Recommendation mining,

classification, clustering and association rules.

Page 23: Big Data : a 360° Overview

BIG DATA ADOPTION :

NOW OR LATER ?

Page 24: Big Data : a 360° Overview

The answer to this question must lie in the integration and the operationalization of analytics as a whole part

of the organization's business process. This suppose organization is data-driven. the big data approach is

mostly suited to addressing or solving business problems that are subject to one or more of the following

criteria:

1. Data throttling:

2. Computation-restricted throttling

3. Large data volumes

4. Significant data variety

5. Benefits from data parallelization

Page 25: Big Data : a 360° Overview

• Even if we have always had a lot of data, the difference today is that significantly more of it

exists, and it varies in type and timeliness. To cope with this problem , you have to think

about managing data differently. That is where comes the "Big Data".

• Big Data is the name given to the data management challenges and opportunities that

emerge when dealing with data that is extremely large in volume, has extremely high

velocity and is extremely wide in variety.

• Big Data without Analytics is just data

• Just Because You Have Insights Doesn’t Guarantee You Have The Power To Act on Them.

• every problem is not suitable for Big Data

• MapReduce is a programming model that allow to manage large-scale data computations

in a way that is tolerant of hardware fault.

• Hadoop is a platform that implements MapReduce and provide a redundant, reliable and

distributed file system optimized for large files.

What Should I remember ?

Page 26: Big Data : a 360° Overview

Some Big Data Providers

- Cloudera, with its first commercial distribution of Hadoop

- HortonWorks, with its commercial distribution of Hadoop

- SAS Institute with its SAS on Hadoop platform, SAS High Performance Suite, SAS Grid

Computing and SAS Visual Analytics

- HP with its platform called HP Vertica

- EMC with its platform called GreenPlum Pivotal

Here are some Big Data providers I personally know. There are some others.

Page 27: Big Data : a 360° Overview

Bibliography & Resourceshttp://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf

Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering

http://eprints.ecs.soton.ac.uk/18483/

Online applications : http://www.convo.co.uk/x02/

http://mahout.apache.org/

EMC Data Science & Big Data Analytics Training Module

https://education.emc.com/guest/campaign/data_science.aspx

SAS Official Predictive Modeling Training Course

https://support.sas.com/edu/schedules.html?id=1366&ctry=us

https://support.sas.com/edu/schedules.html?id=1220&ctry=US

Big Data for Dummies by Judith Hurwitz, Alan NUGENT, Dr. Fern Halper, Marcia Kaufman

ISBN : 978-1-118-50422-2 www.wiley.com

Gartner : http://www.gartner.com/it-glossary/big-data/

The Harvard Business Review :

https://hbr.org/2013/12/you-may-not-need-big-data-after-all/ar/1

MapReduce: Simplified Data Processing on Large Clusters (from Google)

http://static.googleusercontent.com/media/research.google.com/fr//archive/mapreduce-osdi04.pdf

Hadoop Apache Foundation

http://hadoop.apache.org/

TDWI : http://tdwi.org/

Page 28: Big Data : a 360° Overview

About Me

• I am a freelance/Consultant who help organisations leverage their data to improve their performance

through the right tool, the right methodology and the right technology. I have over 3 years of

experience and 5 Certifications. I am a highly certified SAS Professional and also a certified EMC²

Data Scientist.

Data Information KnowledgeActionable

plansPerformance

Contact

Mail : [email protected]

Twitter : @Juvenal_JVC

Linkedin : http://fr.linkedin.com/pub/juv%C3%A9nal-chokogoue/52/965/a8

Page 29: Big Data : a 360° Overview

Thank you for attending, I sincerely hope

this module will be helpful for you !

The Full version will be available soon !!!!