Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics

Thomas W. DinsmoreConsultant and Author

Do we needa distributed platformfor machine learning?

Arguments Against

1

1

Small datasets

Large datasets (> 1TB)

1

64

66

68

70

72

74

76

78

80

10,000 100,000 1,000,000 10,000,000

AUC

Sample Size

Model Accuracy

Source: http://datascience.la/benchmarking-random-forest-im plementations/

2

3

4

Titan A4504x16=64 Cores

1TB RAM~$20K

5

Arguments For

1

Data Wrangling

Feature Engineering

Model Training Prediction

• Structure• Select• Sample• Aggregate

• Transform• Score

Enterprise DataData Wrangling

Feature Engineering

Model Training

Scoring

2

3


64

66

68

70

72

74

76

78

80

10,000 100,000 1,000,000 10,000,000

AUC (*)

Sample Size

Model Accuracy

Linear Random Forests

4


(*) Holdout Sample

5

vs.

1 GPU: CNTK is a little faster

0

2,000

4,000

6,000

8,000

10,000

12,000

1 GPU 1 x 4 GPUs 2 x 4 GPUs

Speed

Deep Learning Benchmark

CNTK TensorFlow

4 GPUs: CNTK is a lot faster.

05,000

10,00015,00020,00025,00030,00035,00040,00045,000


Speed


CNTK TensorFlow

010,00020,00030,00040,00050,00060,00070,00080,000


Speed


CNTK TensorFlow

2x4 GPUs: TensorFlow can’t.

The future of analytics is distributed.• Your data sources and targets are distributed.

– You may only need a snippet of data – You still have to retrieve that snippet

• Data movement is expensive.• Data requirements are expanding.• Machine learning algorithms can use more data.• When you need capability, you’d better have it.

Distribution Framework





Open Source Tool


Open Source Tool

Open Source Tool

Data Wrangling

Feature Engineering

Model Training Scoring

Data Loader for Hadoop

SPSS Analytic Server

• Most functions push down to Spark• Can embed PySpark, SparkR scripts

• Graphical interface to Spark MLlib• Limited data manipulation functions• Scoring interface through PMML

Sparkling Water package:• Enables H2O to work with Spark RDDs, DataFrames• Publish Spark data structures as H2O Frames

Data Profiling

Data Profiling

Feature Engineering

Model Training

Data Profiling

Feature Engineering

Model Training

Leaderboard

Data Profiling

Feature Engineering

Model Training

Model Selection

Leaderboard

Data Profiling

ModelDeployment

Feature Engineering

Model Training

Model Selection

Summary• Yes, distributed machine learning is necessary.• Need generalized distribution framework.• Today, Spark is the only game in town.• Race is on to deliver push-down Spark integration.

THANK YOU.Thomas W. Dinsmore@thomaswdinsmore

The Big Analytics Blogwww.thomaswdinsmore.com

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data & Analytics