Top Banner
Teradata Proprietary and Confidential BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS Dr. Bruce Aldridge Sr. Industry Consultant Hi-Tech Manufacturing Teradata 760.458.1376 [email protected]
23

BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

Jan 19, 2015

Download

Business

TIBCO Spotfire

Presented by: Dr. Bruce Aldridge, Sr. Industry Consultant Hi-Tech Manufacturing, Teradata

TIBCO Spotfire and Teradata: First to Insight, First to Action; Warehousing, Analytics and Visualizations for the High Tech Industry Conference
July 22, 2013 The Four Seasons Hotel Palo Alto, CA
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

Teradata Proprietary and Confidential

BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

Dr. Bruce Aldridge Sr. Industry Consultant Hi-Tech Manufacturing Teradata 760.458.1376 [email protected]

Page 2: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

2 7/30/2013 Teradata Confidential

Overview of Topics

• “Big Data” Analytics > The problems of extreme data

> Key principles for analytic engines

• Analytic Technologies > Changing from sequential to parallel

> Design for analytics

• Operationalizing Analytics > Analytic life cycle management

> Visualization / interacting

Page 3: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

3 7/30/2013 Teradata Confidential

What is “Big Data”?

Big Data: any information that’s too fast, too large or doesn’t fit what you are using

Data Explosion

> Automation of equipment and business processes

> Sensor integration

> Communication (networks / web)

> Compliance

Page 4: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

4 7/30/2013 Teradata Confidential

Using Big Data

• Collecting data and using data are different things

> Data Lakes serve as high volume low cost repositories for collection

> Data may be semi-structured or structured - frequently the conversion happening within the repository

> Large amounts of data may be stored for reporting, compliance or investigations

• Unusual or new events provide learning (Most “big data” will not provide new information or knowledge)

Page 5: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

5 7/30/2013 Teradata Confidential

Guidelines for Big Data

• Collecting ≠ learning ≠ Using data > Data stored on appropriate system for use

> Data mining and statistic tools for learning

> Model publication (PMML) & monitor for deployment

> Visualization tools critical for all

Page 6: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

6 7/30/2013 Teradata Confidential 6 > 7/30/2013

Extreme data brings new challenges

• New techniques to limit variables for analysis / modeling

• Emergence of columnar analytics

• Wealth of data results in more variables than responses

𝑦𝑚 = 𝑓 𝑥1, 𝑥2, 𝑥3, 𝑥4, … , 𝑥𝑛 where n>m

• Data organization struggles with wide data (>100,000 columns)

Id V1 V2 V3 V4 V5 V6 V7 V8 v9

AA 1.2 3.1 41 56 ‘a’ 9 0.2 ? ?

AB 0.9 2.7 41 62 ‘a’ 8 0.2 1.1 7

BA 1.0 2.9 42 57 ‘b’ 9 0.1 1.1 ?

Id Col ID

Val

AA V1 1.2

AA V3 41

AB V1 0.9

AB V8 1.1

AB V2 2.7

“pivot”

Id V1 V2 V3 V4 V5 V6 V7 V8 v9

CA 1.2 3.1 41 56 ‘a’ 9 0.2 ? ?

CB 0.9 2.7 41 62 ‘a’ 8 0.2 1.1 7

BB 1.0 2.9 42 57 ‘b’ 9 0.1 1.1 ?

Id Col ID

Val

AA V1 1.2

AA V3 41

AB V1 0.9

AB V8 1.1

AB V2 2.7

CA V4 56

CB V3 41

BB V1 1.0

Multiple

tables add

more rows

Page 7: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

7 7/30/2013 Teradata Confidential

Technology Requirements for “Big Data” Analytics

• Need for large amounts of data storage

• Ability to get at the data (SQL)

• Availability of tools for > Visualization

> Characterizing, organizing and cleaning data

> Summarizing (descriptive statistics)

> Analyzing (predictive models, data discovery)

> Monitoring & reporting

• Analytic Fault Tolerance (massive systems imply more failures)

• Dynamic growth – ability to add more capability without “starting over” – mixing technologies

• ROI

l l l l l l l l

Page 8: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

8 7/30/2013 Teradata Confidential

Analytic Tools

• Faster analytics require a different approach – Parallel > Sequential processing will be limited

> Parallel analytics distributes calculations across multiple nodes with each node having the data necessary

> Management of calculation (distribution) and collection

• Because data is generally stored on multiple nodes, so…..

No choice but to bring the analytics to the data.

Data

Analytic Modeling

Tools

Business Results

Local Data repository

Parallel Analytic

Procedures

Simple reporting / management tools

Data

Page 9: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

9 7/30/2013 Teradata Confidential

Putting it all together: Analytic Architecture

LANGUAGES MATH & STATS DATA MINING

DISCOVERY

PLATFORM

LOW COST – HIGH CAPACITY PARALLEL DATA LAKE

CAPTURE | STORE | REFINE

LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS

FLEXIBLE ANALYTIC

/ DISCOVERY

PLATFORM

REPORTING /

MONITOR SYSTEM

OF RECORD -

DATA WAREHOUSE

AUDIO & VIDEO IMAGE

S

TEXT WEB & SOCIAL MACHINE LOGS CR

M

SCM ER

P

Environment for: • Low Cost high capacity storage • High power analytics • Fault tolerant high performance reporting • Exploration / visualization across all areas

Visualization exploration

Page 10: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

10 7/30/2013 Teradata Confidential

Data Preparation

Transform, clean and aggregate data to form data

set suitable for analysis

Monitor / Model Deployment

Deploy statistical model to run iroutinely - automatically

monitoring for control

Data Exploration

Explore all data with statistical profiling and visualization

Understand / Model the data

Apply mathematical / relational models to test

hypotheses about the data

Modeling ADS

Sample Data

Build ADS

Production ADS Automated process

Analytics Process

SQL In-dbs Function

PMML or UDF Models

Page 11: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

11 7/30/2013 Teradata Confidential

• Business / Data understanding

> Defining objects and requirements of the business

> Data collection and data profiling / characterization

• Data preparation – joins between tables, attribute selection, cleaning, building new values

• Modeling: Analytic algorithms applied and parameters adjusted

• Evaluation: results scored according to objectives and requirements

• Deployment: Models and parameters put into on demand or automatic execution on new data

Analytic discovery process

CRISP – Cross Industry Standard Process – data mining

Page 12: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

12 7/30/2013 Teradata Confidential

Analysis – The generation of knowledge

Generation of knowledge is iterative and interactive • An idea related to a problem or observation is formulated • Data is collected to support or refute the idea (deduction – what kind of

data is necessary?) • Analysis is made on the data to validate or refute (induction) • Results either support / reject idea or suggest modifications

Monitoring • Known analytic models used for prediction / verification • Adjust / control based on prediction vs. observation • Business scoring used to prioritize

Data (facts, phenomena)

Idea (model, hypothesis, theory, conjecture)

Monitor / control Validate Revise

Page 13: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

13 7/30/2013 Teradata Confidential

Establishing a Robust Environment

Quality Information

Master Data Management

Data Profiling / visualization

Logical/Physical Model

Data “correction”

Data Steward/Cleansing processes

Discovery

Statistical / Data Mining Tools

Secure access

Robust Analysis capabilities

Visualization / understanding

Clear and significant results

Flexibility in data and models

Automation and Alerts

Simple publication of discovery knowledge

Automated pattern/anomaly detection

Business scoring for notification and escalation

Clear communication of results

Visualization and Reports

Choice of the tools to match needs (e.g. Dashboard vs. Engineering views)

Timing and need for data refresh

Reporting on Core or staging

Consistent use of metrics/results (e.g. analytics in database vs. at the reporting layer)

Page 14: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

14 7/30/2013 Teradata Confidential

Analytics – Key Requirements

• Performance: > Parallel processing - true shared nothing architecture

> Data structure influences analytics (order of magnitude)

> Management of analytics and data critical

• Fault tolerance > More nodes WILL result in more failures

> Analytic Fault Tolerance is more than database fault tolerance – the ability to avoid restarting the analytics

• Different node performance > Execution in parallel will never be identical – adjust for node

differences

> System expansions must be compatible

• Flexible analytics > Big data analytics combine queries with analytic functions

> Analytic languages not parallel (in general) – need ability to add / customize new functions

Page 15: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

15 7/30/2013 Teradata Confidential

Analytic Applications

• Existing parallel analytics > In-database proprietary

> In-database addons (Fuzzy Logix, SAS, Partial R)

> Hybrid (Aster) – Database architecture supporting MAP-Reduce functions

• Many existing applications moving parallel > SAS: Partnered with Teradata for seamless in-

database execution of more analytics

> R: Partnered with Revolution R for rapid data extraction and execution of some analytics in-database AND in parallel

> Spotfire: Execution of aggregation analytics and ability to define in-database analytic functions. Embedded TERR (Tibco Enterprise Runtime R)

• Write your own > Map reduce framework

> User defined functions

Page 16: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

16 7/30/2013 Teradata Confidential 16 > 7/30/2013

Analytic Libraries and Enhancements

Database built in:

• Descriptive Statistics

• Basic data mining models (regression, cluster, trees, PCA)

• User defined functions

Partners

• Revolution R, SAS, Fuzzy Logix, Spotfire, …

Enhancements

• High Speed connections

• “Native” data storage

Page 17: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

17 7/30/2013 Teradata Confidential

Device

Lot

Raw Data

Wafer

Dashboard as an Analytic Tool • The Dashboard becomes a 2-way interface • User interaction parameterizes and launches new

analytics

Page 18: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

18 7/30/2013 Teradata Confidential

Integration of “Dashboard”

• Reporting / visualization tool with ability to execute custom functions in-database > Empower all users - ability to publish in-database analytics to users

Page 19: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

19 7/30/2013 Teradata Confidential

Monitor Analytics

• Analytic models generally are published into SQL compatible queries

• Applying models to data involves:

> Gather and format data for analytic

> Group data into consistent sets

> Screen data

> Apply algorithms

> Evaluate results

• Complete Sequence applied to

> Massive amounts of analyses

> Repetitive / automated analyses

> Scoring / Triage to identify most significant results

Page 20: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

20 7/30/2013 Teradata Confidential

An Analytic Monitor approach

User direct edit of

Group Description Table

(very infrequent)

View for

Instances

Stage

Data

Group

Instance

Table

Group

Description

Table

Core

EDW

Data

Model

Reporting,

BI

& Alert

management

tools Alert

settings

Core

EDW

Data

Model

Core

EDW

Data

Model

Core

EDW

Data

Model

Standard

ETL

Views for

Data (Group Instance)

Creation of Views

to evaluate

load data

(installation /

dba level user)

Analytic procedures

1) Update Group

instance Table with new / changed data

2) Identify new & core data for required calculations

3) Perform calculations with standard or custom libraries

4) Compare results to business rules or statistical tests

5) Update alert and report flags

Result

Table

Work

Tables

(optional)

Workflow

Status /

Control

Table

ETL starts Analytic

Stored Procedure

And verifies

completion

Page 21: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

21 7/30/2013 Teradata Confidential

Statistical summaries vs……

Powerful in-DB Analysis enables the use of Simple BI Tools

Page 22: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

22 7/30/2013 Teradata Confidential

Data Graphs

Powerful in-DB Analysis enables the use of Simple BI Tools

Analysis of Big Data: 76M rows of Telemetry (16 of 159 plots/units shown ) graphed and stored in-database for evaluation.

Engine hrs. vs. date

Page 23: BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

23 7/30/2013 Teradata Confidential

Summary

23 Proprietary Information of Teradata and Infor Corporations©

• Analytics on “Big Data” will require one or more high performance (parallel) systems connected to an interactive interface • Support combinations of high volumes (data

lakes), high performance and flexible advanced analytics

• Tools for understanding, cleansing, discovery AND monitoring necessary

• Interactive Visualization support across all systems

• Management of analytics and fault control