Top Banner
From Data to Decisions: New Strategies for Deploying Analytics Using Clouds Robert Grossman Open Data Group July 29, 2009
28
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

From Data to Decisions: New Strategies for Deploying Analytics Using Clouds

Robert GrossmanOpen Data Group

July 29, 2009

Page 2: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Cloud computing has changed analytic infrastructure and enabled new classes of analytic algorithms. It’s time to rethink your analytic strategy.

Analytic Strategy

Analytics Analytic Infrastructure

Overview

Page 3: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Part 1

Quick Review of Clouds

3

Page 4: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

What is a Cloud? Clouds provide on-demand resources or

services over a network, often the Internet, with the scale and reliability of a data center.

No standard definition. Cloud architectures are not new. What is new:– Scale– Ease of use– Pricing model.

4

Page 5: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

5

Scale is new.

Page 6: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Elastic, Usage Based Pricing Is New

6

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Elastic, usage based pricing turns capex into opex. Clouds can manage surges in computing needs.

Page 7: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Simplicity Offered By the Cloud is New

7

+ .. and you have a computer ready to work.

A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.

Page 8: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Two Types of Clouds

On-demand resources & services over a network at the scale of a data center

On-demand computing instances (IaaS)– IaaS: Amazon EC2, S3, etc.; Eucalyptus– supports many Web 2.0 applications/users

On-demand cloud services for large data cloud applications (PaaS for large data clouds)– GFS/MapReduce/Bigtable, Hadoop, Sector, …– Manage and compute with large data (say 10+ TB)

8

Page 9: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Cloud Architectures – How Do You Fill a Data Center?

Cloud Storage Services

Cloud Compute Services (MapReduce & Generalizations)

Cloud Data Services (BigTable, etc.)

Quasi-relational Data Services

App App App App App

App App

App App

on-demand computing capacity

App App App…

on-demand computing instances

Page 10: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

What is Analytic Infrastructure ...

10

Part 2

… and why you should care.

Page 11: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

What is Analytics?

Short Definition Using data to make decisions.

Longer Definition Using data to take actions and make decisions

using models that are empirically derived and statistically valid.

It is important to understand the difference between reporting and analytics.

11

Page 12: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

12

Risk Models

Direct Marketing Models

Online Models

Page 13: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

What is the Size of Your Data?

Small– Fits into memory

Medium– Too large for memory– But fits into a database– N.B. databases are designed for safe writing of rows

Large– To large for a database– But can use specialized file system (column-wise)– Or storage cloud (Google File System, Hadoop DFS)

13

Page 14: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

(Very Simplified) Architectural View

The Predictive Model Markup Language (PMML) is an XML language for statistical and data mining models (www.dmg.org).

With PMML, it is easy to move models between applications and platforms.

14

Model Producer

PMMLModelData

Page 15: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

(Simplified) Architectural View

PMML also supports XML elements to describe data preprocessing.

15

Model Producer

PMMLModel

DataData Pre-

processing features

algorithms to estimate models

Page 16: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Three Important Interfaces

16

Model ProducerData

Data Pre-processing

data

PMMLModel

Model Consumer

scores

Post Processing

actions

1 1

2

2

PMMLModel

3 3

Modeling Environment

Deployment Environment

1

Page 17: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Actually, This is a Typically a Component in a Workflow

17

Page 18: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

With the proper analytic infrastructure, cloud computing can be used for data preprocessing, for scoring, for producing models, and as a platform for other services in the analytic infrastructure.

18

Page 19: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Cloud Programming Models for Working With Large Data

19

Part 3

Page 20: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Map-Reduce Example

Both input & output are (key, value) pairs Input is file with one document per record User specifies map function– key = document URL– Value = terms that document contains

(“doc cdickens”, “it was the best of times”)

“it”, 1“was”, 1“the”, 1“best”, 1map

Page 21: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Example (cont’d) MapReduce library gathers together all pairs

with the same key value (shuffle/sort phase) The user-defined reduce function combines all

the values associated with the same key

key = “it”values = 1, 1

key = “was”values = 1, 1key = “best”values = 1key = “worst”values = 1

“it”, 2“was”, 2“best”, 1“worst”, 1reduce

Page 22: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Using Clouds for Scoring (Model Consumers)

22

Part 4

Page 23: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

What is a Statistical/Data Mining Model? Infrastructure– Inputs: data attributes, mining attributes– Outputs, targets– Transformations– Segmented models, ensembles of models

Models that are part of a standard– Trees, SVMs, neural networks, cluster models, etc.– In this case, only need to specify parameters

Arbitrary models– e.g. arbitrary code that takes inputs to outputs

23

Page 24: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

From an Architectural Viewpoint

In an operational environment in which models are being deployed, it may be useful to “Just so no to viewing models as arbitrary code”

The deployment can be much shorter if a scoring engine reads a PMML file instead of integrating a new piece of code containing a model.

24

Page 25: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Model Producers/Consumers in Clouds Model Consumers take analytic models and use

them to score data– Very easy to deploy in a cloud– Deploy a scoring engine in a cloud and then simply

read PMML files– Very easy to scale up with cloud surges

Model Producers take data and produce models– Data parallel applications can be ported to clouds.– Others require weighing several factors.

25

Page 26: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

26

Model ProducerData

Data Pre-processing

data

PMMLModel

Model Consumer

scores

Post Processing

actions

PMMLModel

Modeling can be done in-house.

Scoring engine deployed in a cloud.

Sometimes it makes sense to the pre-processing in the cloud, especially if the data is there.

Page 27: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

Summary

Innovation ImpactPMML With PMML, it is easy to move models and

preprocessing between apps; supports life cycle management of models.

Scoring engines

Simplifies deployment of models; enables scoring in clouds.

Large data clouds

1) Can preprocess data to build features on TB size datasets; 2) Can build analytic models on TB size datasets.

Page 28: The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5

For More Information

Contact information: Robert Grossmanblog.rgrossman.comwww.rgrossman.com

28

www.opendatagroup.com