Top Banner
Big Data Introducción Santiago González <[email protected]>
44
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: introduccion_final.pdf

Big Data

Introducción

Santiago González <[email protected]>

Page 2: introduccion_final.pdf

Contenidos

• Por que BIG DATA?

• Características de Big Data

• Tecnologías y Herramientas Big Data

• Paradigmas fundamentales Big Data

• Data Mining

• Visualización

DIAPOSITIVA 1

Page 3: introduccion_final.pdf

Por qué BIG DATA?

DIAPOSITIVA 2

We are drawing on

data but starving on

knowledge !!

Page 4: introduccion_final.pdf

Por qué BIG DATA?

• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

3

DIAPOSITIVA 3

Page 5: introduccion_final.pdf

Quien genera y usa datos?

Social media and networks

(all of us are generating data) Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and

networks

(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable fashion

DIAPOSITIVA 4

Page 6: introduccion_final.pdf

Evolución

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

DIAPOSITIVA 5

Page 7: introduccion_final.pdf

Big Data

• “Big data refers to the tools, processes

and procedures allowing an organization

to create, manipulate, and manage very

large data sets and storage

facilities”(zdnet.com)

• The big deal about big data is the potential

for getting more value more quickly from

more data, at a lower cost and with greater

agility. (Brian Hopkins, zdnet)

DIAPOSITIVA 6

Page 8: introduccion_final.pdf

Big Data

“Big Data” is data whose scale, diversity,

and complexity require new architecture,

techniques, algorithms, and analytics to

manage it and extract value and hidden

knowledge from it…

DIAPOSITIVA 7

Page 9: introduccion_final.pdf

Características de Big Data

DIAPOSITIVA 8

Page 10: introduccion_final.pdf

Características de Big Data:

Volume • Data Volume

– 44x increase from 2009 2020

– From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Exponential increase in

collected/generated data

DIAPOSITIVA 9

Page 11: introduccion_final.pdf

Características de Big Data:

Varity • Various formats, types, and

structures

• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

• Static data vs. streaming data

• A single application can be generating/collecting many types of data

To extract knowledge all

these types of data need to

linked together

DIAPOSITIVA 10

Page 12: introduccion_final.pdf

Características de Big Data:

Velocity • Data is begin generated fast and need to be

processed fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples – E-Promotions: Based on your current location, your purchase history,

what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body

any abnormal measurements require immediate reaction

DIAPOSITIVA 11

Page 13: introduccion_final.pdf

Big Data: 3V’s

DIAPOSITIVA 12

Page 14: introduccion_final.pdf

Incluso 4V’s!

DIAPOSITIVA 13

Page 16: introduccion_final.pdf

Retos

• The Bottleneck is in technology – New architecture, algorithms, techniques are needed

• Also in technical skills – Experts in using the new technology and dealing with big

data

DIAPOSITIVA 15

Page 17: introduccion_final.pdf

Tecnologías y Herramientas

Big Data

DIAPOSITIVA 16

Page 18: introduccion_final.pdf
Page 19: introduccion_final.pdf

Arquitectura

DIAPOSITIVA 18

Page 20: introduccion_final.pdf

Paradigmas fundamentales

• MapReduce

DIAPOSITIVA 19

Page 21: introduccion_final.pdf

Paradigmas fundamentales

• Teorema CAP

DIAPOSITIVA 20

Page 22: introduccion_final.pdf

Business Intelligence

• Statistics

• Data mining

• Knowledge Discovery in Data (KDD)

• Predictive Analytics

• Business Analytics

• Data Science

• Data Analytics

• …

Same Core Idea:

Finding Useful Patterns in Data

Different Emphasis

DIAPOSITIVA 21

Page 23: introduccion_final.pdf

Data Mining

DIAPOSITIVA 22

Page 24: introduccion_final.pdf

• Lots of data is being collected and warehoused – Web data, e-commerce

– purchases at department/ grocery stores

– Bank/Credit Card transactions

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

DIAPOSITIVA 23

¿Por qué?

Page 25: introduccion_final.pdf

• Data collected and stored at

enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene

expression data

– scientific simulations

generating terabytes of data

• Traditional techniques infeasible for raw data

• Data mining may help scientists

– in classifying and segmenting data

– in Hypothesis Formation

¿Por qué?

DIAPOSITIVA 24

Page 26: introduccion_final.pdf

¿Qué es? – Non-trivial extraction of implicit, previously unknown

and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

DIAPOSITIVA 25

Page 27: introduccion_final.pdf

• Draws ideas from machine learning/AI, pattern

recognition, statistics, and database systems

• Traditional Techniques

may be unsuitable due to

– Enormity of data

– High dimensionality

of data

– Heterogeneous,

distributed nature

of data

Origenes

Machine Learning/

Pattern

Recognition

Statistics/

AI

Data Mining

Database

systems

DIAPOSITIVA 26

Page 28: introduccion_final.pdf

CRISP-DM

• Why Should There be a Standard

Process?

– The data mining process must be reliable and

repeatable by people with little data mining

background.

DIAPOSITIVA 27

Page 29: introduccion_final.pdf

CRISP-DM

• Why Should There be a Standard

Process?

– Allows projects to be replicated

– Aid to project planning and management

– Allows the scalability of new algorithms

DIAPOSITIVA 28

Page 30: introduccion_final.pdf

CRoss-Industry Standard

Process

for Data Mining

The CRISP-DM Model: The New Blueprint

for DataMining”, Colin Shearer, JOURNAL

of Data Warehousing, Volume 5, Number 4,

p. 13-22, 2000

DIAPOSITIVA 29

Page 31: introduccion_final.pdf

CRISP-DM

DIAPOSITIVA 30

Page 32: introduccion_final.pdf

CRISP-DM • Business Understanding:

– Project objectives and requirements understanding, Data mining problem definition

• Data Understanding:

– Initial data collection and familiarization, Data quality problems identification

• Data Preparation:

– Table, record and attribute selection, Data transformation and cleaning

• Modeling:

– Modeling techniques selection and application, Parameters calibration

• Evaluation:

– Business objectives & issues achievement evaluation

• Deployment:

– Result model deployment, Repeatable data mining process implementation

DIAPOSITIVA 31

Page 33: introduccion_final.pdf

CRISP-DM

Business

Understanding Data

Understanding

Data

Preparation Modeling Deployment Evaluation

Format

Data

Integrate

Data

Construct

Data

Clean

Data

Select

Data

Determine

Business

Objectives

Review

Project

Produce

Final

Report

Plan Monitering

&

Maintenance

Plan

Deployment

Determine

Next Steps

Review

Process

Evaluate

Results

Assess

Model

Build

Model

Generate

Test Design

Select

Modeling

Technique

Assess

Situation

Explore

Data

Describe

Data

Collect

Initial

Data

Determine

Data Mining

Goals

Verify

Data

Quality

Produce

Project Plan

DIAPOSITIVA 32

Page 34: introduccion_final.pdf

CRISP-DM

• Business Understanding and Data

Understanding

DIAPOSITIVA 33

Page 35: introduccion_final.pdf

CRISP-DM

• Knowledge acquisition techniques

Knowledge Acquisition,

Representation, and

Reasoning

Turban, Aronson, and Liang,

Prentice Hall, Decision Support

Systems and Intelligent

Systems, 7th Edition, 2005

DIAPOSITIVA 34

Page 36: introduccion_final.pdf

DM Tools

• Open Source

• Weka

• Orange

• R-Project

• KNIME

• Commercial

• SPSS

• Clementine

• SAS Miner

• Matlab

• …

DIAPOSITIVA 35

Page 37: introduccion_final.pdf

DM Tools

• Weka 3.6

– Java

– Excellent library, regular interface

– http://www.cs.waikato.ac.nz/ml/weka/

• Orange

• R-Project

• KNIME

DIAPOSITIVA 36

Page 38: introduccion_final.pdf

DM Tools

• Weka 3.6

• Orange

– C++ and Python

– Regular library !, good interface

– http://orange.biolab.si/

• R-Project

• KNIME

DIAPOSITIVA 37

Page 39: introduccion_final.pdf

DM Tools

• Weka 3.6

• Orange

• R-Project

– Similar than Matlab and Maple

– Powerfull libraries, Regular interface. Too

slow for file access!

– http://cran.es.r-project.org/

• KNIME

DIAPOSITIVA 38

Page 40: introduccion_final.pdf

DM Tools

• Weka 3.6

• Orange

• R-Project

• KNIME

– Java

– Includes Weka, Python and R-Project

– Powerfull libraries, good interface

– http://www.knime.org/download-desktop

DIAPOSITIVA 39

Page 41: introduccion_final.pdf

DM Tools

• Let’s go to install KNIME!!

DIAPOSITIVA 40

Page 42: introduccion_final.pdf

Visualización

DIAPOSITIVA 41

Page 43: introduccion_final.pdf

Visualización

DIAPOSITIVA 42

Page 44: introduccion_final.pdf

Big Data

Introducción

Santiago González <[email protected]>