Top Banner
Spark and the Future of Advanced Analytics Thomas W. Dinsmore Consultant and Author
42

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Jan 11, 2017

Download

Data & Analytics

Spark Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics

Thomas W. DinsmoreConsultant and Author

Page 2: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Do we needa distributed platformfor machine learning?

Page 3: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Arguments Against

Page 4: Spark and the Future of Advanced Analytics by Thomas Dinsmore

1

Page 5: Spark and the Future of Advanced Analytics by Thomas Dinsmore

1

Page 6: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Small datasets

Large datasets (> 1TB)

1

Page 7: Spark and the Future of Advanced Analytics by Thomas Dinsmore

64

66

68

70

72

74

76

78

80

10,000 100,000 1,000,000 10,000,000

AUC

Sample Size

Model Accuracy

Source: http://datascience.la/benchmarking-random-forest-im plementations/

2

Page 8: Spark and the Future of Advanced Analytics by Thomas Dinsmore

3

Page 9: Spark and the Future of Advanced Analytics by Thomas Dinsmore

4

Titan A4504x16=64 Cores

1TB RAM~$20K

Page 10: Spark and the Future of Advanced Analytics by Thomas Dinsmore

5

Page 11: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Arguments For

Page 12: Spark and the Future of Advanced Analytics by Thomas Dinsmore

1

Page 13: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Wrangling

Feature Engineering

Model Training Prediction

• Structure• Select• Sample• Aggregate

• Transform• Score

Page 14: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Enterprise DataData Wrangling

Feature Engineering

Model Training

Scoring

Page 15: Spark and the Future of Advanced Analytics by Thomas Dinsmore

2

Page 16: Spark and the Future of Advanced Analytics by Thomas Dinsmore

3

Page 17: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Source: http://datascience.la/benchmarking-random-forest-im plementations/

64

66

68

70

72

74

76

78

80

10,000 100,000 1,000,000 10,000,000

AUC (*)

Sample Size

Model Accuracy

Linear Random Forests

4

Source: http://datascience.la/benchmarking-random-forest-im plementations/

(*) Holdout Sample

Page 18: Spark and the Future of Advanced Analytics by Thomas Dinsmore

5

Page 19: Spark and the Future of Advanced Analytics by Thomas Dinsmore

vs.

Page 20: Spark and the Future of Advanced Analytics by Thomas Dinsmore

1 GPU: CNTK is a little faster

0

2,000

4,000

6,000

8,000

10,000

12,000

1 GPU 1 x 4 GPUs 2 x 4 GPUs

Speed

Deep Learning Benchmark

CNTK TensorFlow

Page 21: Spark and the Future of Advanced Analytics by Thomas Dinsmore

4 GPUs: CNTK is a lot faster.

05,000

10,00015,00020,00025,00030,00035,00040,00045,000

1 GPU 1 x 4 GPUs 2 x 4 GPUs

Speed

Deep Learning Benchmark

CNTK TensorFlow

Page 22: Spark and the Future of Advanced Analytics by Thomas Dinsmore

010,00020,00030,00040,00050,00060,00070,00080,000

1 GPU 1 x 4 GPUs 2 x 4 GPUs

Speed

Deep Learning Benchmark

CNTK TensorFlow

2x4 GPUs: TensorFlow can’t.

Page 23: Spark and the Future of Advanced Analytics by Thomas Dinsmore

The future of analytics is distributed.• Your data sources and targets are distributed.

– You may only need a snippet of data – You still have to retrieve that snippet

• Data movement is expensive.• Data requirements are expanding.• Machine learning algorithms can use more data.• When you need capability, you’d better have it.

Page 24: Spark and the Future of Advanced Analytics by Thomas Dinsmore
Page 25: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Distribution Framework

Distribution Framework

Distribution Framework

Distribution Framework

Distribution Framework

Page 26: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Open Source Tool

Distribution Framework

Open Source Tool

Open Source Tool

Page 27: Spark and the Future of Advanced Analytics by Thomas Dinsmore
Page 28: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Wrangling

Feature Engineering

Model Training Scoring

Page 29: Spark and the Future of Advanced Analytics by Thomas Dinsmore
Page 30: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Loader for Hadoop

SPSS Analytic Server

Page 31: Spark and the Future of Advanced Analytics by Thomas Dinsmore
Page 32: Spark and the Future of Advanced Analytics by Thomas Dinsmore

• Most functions push down to Spark• Can embed PySpark, SparkR scripts

Page 33: Spark and the Future of Advanced Analytics by Thomas Dinsmore
Page 34: Spark and the Future of Advanced Analytics by Thomas Dinsmore

• Graphical interface to Spark MLlib• Limited data manipulation functions• Scoring interface through PMML

Page 35: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Sparkling Water package:• Enables H2O to work with Spark RDDs, DataFrames• Publish Spark data structures as H2O Frames

Page 36: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Profiling

Page 37: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Profiling

Feature Engineering

Model Training

Page 38: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Profiling

Feature Engineering

Model Training

Leaderboard

Page 39: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Profiling

Feature Engineering

Model Training

Model Selection

Leaderboard

Page 40: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Data Profiling

ModelDeployment

Feature Engineering

Model Training

Model Selection

Page 41: Spark and the Future of Advanced Analytics by Thomas Dinsmore

Summary• Yes, distributed machine learning is necessary.• Need generalized distribution framework.• Today, Spark is the only game in town.• Race is on to deliver push-down Spark integration.

Page 42: Spark and the Future of Advanced Analytics by Thomas Dinsmore

THANK YOU.Thomas W. Dinsmore@thomaswdinsmore

The Big Analytics Blogwww.thomaswdinsmore.com