introduccion_final.pdf

Big Data

Introducción

Santiago González <[email protected]>

Contenidos

• Por que BIG DATA?

• Características de Big Data

• Tecnologías y Herramientas Big Data

• Paradigmas fundamentales Big Data

• Data Mining

• Visualización

DIAPOSITIVA 1

Por qué BIG DATA?

DIAPOSITIVA 2

We are drawing on

data but starving on

knowledge !!

http://www.cultindustries.com/new/html/frame.html

Por qué BIG DATA?

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

3

DIAPOSITIVA 3

Quien genera y usa datos?

Social media and networks

(all of us are generating data) Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and

networks

(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable fashion

DIAPOSITIVA 4

Evolución

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

DIAPOSITIVA 5

Big Data

• “Big data refers to the tools, processes

and procedures allowing an organization

to create, manipulate, and manage very

large data sets and storage

facilities”(zdnet.com)

• The big deal about big data is the potential

for getting more value more quickly from

more data, at a lower cost and with greater

agility. (Brian Hopkins, zdnet)

DIAPOSITIVA 6

Big Data

“Big Data” is data whose scale, diversity,

and complexity require new architecture,

techniques, algorithms, and analytics to

manage it and extract value and hidden

knowledge from it…

DIAPOSITIVA 7

Características de Big Data

DIAPOSITIVA 8

Características de Big Data:

Volume • Data Volume

– 44x increase from 2009 2020

– From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Exponential increase in

collected/generated data

DIAPOSITIVA 9


Varity • Various formats, types, and

structures

• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

• Static data vs. streaming data

• A single application can be generating/collecting many types of data

To extract knowledge all

these types of data need to

linked together

DIAPOSITIVA 10


Velocity • Data is begin generated fast and need to be

processed fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples – E-Promotions: Based on your current location, your purchase history,

what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body

any abnormal measurements require immediate reaction

DIAPOSITIVA 11

Big Data: 3V’s

DIAPOSITIVA 12

Incluso 4V’s!

DIAPOSITIVA 13

Big Data Bubble?

© 2013 KDnuggets

Gartner Hype Cycle

Big Data

Gartner VP says Big Data is

Falling into the Trough of

Disillusionment, Jan 2013

DIAPOSITIVA 14

http://blogs.gartner.com/svetlana-sicular/big-data-is-falling-into-the-trough-of-disillusionment/



Retos

• The Bottleneck is in technology – New architecture, algorithms, techniques are needed

• Also in technical skills – Experts in using the new technology and dealing with big

data

DIAPOSITIVA 15

Tecnologías y Herramientas

Big Data

DIAPOSITIVA 16

Arquitectura

DIAPOSITIVA 18

Paradigmas fundamentales

• MapReduce

DIAPOSITIVA 19

Paradigmas fundamentales

• Teorema CAP

DIAPOSITIVA 20

Business Intelligence

• Statistics

• Data mining

• Knowledge Discovery in Data (KDD)

• Predictive Analytics

• Business Analytics

• Data Science

• Data Analytics

• …

Same Core Idea:

Finding Useful Patterns in Data

Different Emphasis

DIAPOSITIVA 21

Data Mining

DIAPOSITIVA 22

• Lots of data is being collected and warehoused – Web data, e-commerce

– purchases at department/ grocery stores

– Bank/Credit Card transactions

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

DIAPOSITIVA 23

¿Por qué?

• Data collected and stored at

enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene

expression data

– scientific simulations

generating terabytes of data

• Traditional techniques infeasible for raw data

• Data mining may help scientists

– in classifying and segmenting data

– in Hypothesis Formation

¿Por qué?

DIAPOSITIVA 24

¿Qué es? – Non-trivial extraction of implicit, previously unknown

and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

DIAPOSITIVA 25

• Draws ideas from machine learning/AI, pattern

recognition, statistics, and database systems

• Traditional Techniques

may be unsuitable due to

– Enormity of data

– High dimensionality

of data

– Heterogeneous,

distributed nature

of data

Origenes

Machine Learning/

Pattern

Recognition

Statistics/

AI

Data Mining

Database

systems

DIAPOSITIVA 26

CRISP-DM

• Why Should There be a Standard

Process?

– The data mining process must be reliable and

repeatable by people with little data mining

background.

DIAPOSITIVA 27

CRISP-DM

• Why Should There be a Standard

Process?

– Allows projects to be replicated

– Aid to project planning and management

– Allows the scalability of new algorithms

DIAPOSITIVA 28

CRoss-Industry Standard

Process

for Data Mining

The CRISP-DM Model: The New Blueprint

for DataMining”, Colin Shearer, JOURNAL

of Data Warehousing, Volume 5, Number 4,

p. 13-22, 2000

DIAPOSITIVA 29

CRISP-DM

DIAPOSITIVA 30

CRISP-DM • Business Understanding:

– Project objectives and requirements understanding, Data mining problem definition

• Data Understanding:

– Initial data collection and familiarization, Data quality problems identification

• Data Preparation:

– Table, record and attribute selection, Data transformation and cleaning

• Modeling:

– Modeling techniques selection and application, Parameters calibration

• Evaluation:

– Business objectives & issues achievement evaluation

• Deployment:

– Result model deployment, Repeatable data mining process implementation

DIAPOSITIVA 31

CRISP-DM

Business

Understanding Data

Understanding

Data

Preparation Modeling Deployment Evaluation

Format

Data

Integrate

Data

Construct

Data

Clean

Data

Select

Data

Determine

Business

Objectives

Review

Project

Produce

Final

Report

Plan Monitering

&

Maintenance

Plan

Deployment

Determine

Next Steps

Review

Process

Evaluate

Results

Assess

Model

Build

Model

Generate

Test Design

Select

Modeling

Technique

Assess

Situation

Explore

Data

Describe

Data

Collect

Initial

Data

Determine

Data Mining

Goals

Verify

Data

Quality

Produce

Project Plan

DIAPOSITIVA 32

CRISP-DM

• Business Understanding and Data

Understanding

DIAPOSITIVA 33

CRISP-DM

• Knowledge acquisition techniques

Knowledge Acquisition,

Representation, and

Reasoning

Turban, Aronson, and Liang,

Prentice Hall, Decision Support

Systems and Intelligent

Systems, 7th Edition, 2005

DIAPOSITIVA 34

DM Tools

• Open Source

• Weka

• Orange

• R-Project

• KNIME

• Commercial

• SPSS

• Clementine

• SAS Miner

• Matlab

• …

DIAPOSITIVA 35

DM Tools

• Weka 3.6

– Java

– Excellent library, regular interface

– http://www.cs.waikato.ac.nz/ml/weka/

• Orange

• R-Project

• KNIME

DIAPOSITIVA 36

http://www.cs.waikato.ac.nz/ml/weka/

DM Tools

• Weka 3.6

• Orange

– C++ and Python

– Regular library !, good interface

– http://orange.biolab.si/

• R-Project

• KNIME

DIAPOSITIVA 37

http://orange.biolab.si/

DM Tools

• Weka 3.6

• Orange

• R-Project

– Similar than Matlab and Maple

– Powerfull libraries, Regular interface. Too

slow for file access!

– http://cran.es.r-project.org/

• KNIME

DIAPOSITIVA 38

http://cran.es.r-project.org/



DM Tools

• Weka 3.6

• Orange

• R-Project

• KNIME

– Java

– Includes Weka, Python and R-Project

– Powerfull libraries, good interface

– http://www.knime.org/download-desktop

DIAPOSITIVA 39

http://www.knime.org/download-desktop



DM Tools

• Let’s go to install KNIME!!

DIAPOSITIVA 40

Visualización

DIAPOSITIVA 41

Visualización

DIAPOSITIVA 42

Big Data

Introducción

Santiago González <[email protected]>

introduccion_final.pdf

Documents

big data big data

big data diapositiva

static data

kinds of data

streaming data

velocity data

collected data

big data bubble