Spark and the Future of Advanced Analytics Thomas W. Dinsmore Consultant and Author
Jan 11, 2017
64
66
68
70
72
74
76
78
80
10,000 100,000 1,000,000 10,000,000
AUC
Sample Size
Model Accuracy
Source: http://datascience.la/benchmarking-random-forest-im plementations/
2
Data Wrangling
Feature Engineering
Model Training Prediction
• Structure• Select• Sample• Aggregate
• Transform• Score
Source: http://datascience.la/benchmarking-random-forest-im plementations/
64
66
68
70
72
74
76
78
80
10,000 100,000 1,000,000 10,000,000
AUC (*)
Sample Size
Model Accuracy
Linear Random Forests
4
Source: http://datascience.la/benchmarking-random-forest-im plementations/
(*) Holdout Sample
1 GPU: CNTK is a little faster
0
2,000
4,000
6,000
8,000
10,000
12,000
1 GPU 1 x 4 GPUs 2 x 4 GPUs
Speed
Deep Learning Benchmark
CNTK TensorFlow
4 GPUs: CNTK is a lot faster.
05,000
10,00015,00020,00025,00030,00035,00040,00045,000
1 GPU 1 x 4 GPUs 2 x 4 GPUs
Speed
Deep Learning Benchmark
CNTK TensorFlow
010,00020,00030,00040,00050,00060,00070,00080,000
1 GPU 1 x 4 GPUs 2 x 4 GPUs
Speed
Deep Learning Benchmark
CNTK TensorFlow
2x4 GPUs: TensorFlow can’t.
The future of analytics is distributed.• Your data sources and targets are distributed.
– You may only need a snippet of data – You still have to retrieve that snippet
• Data movement is expensive.• Data requirements are expanding.• Machine learning algorithms can use more data.• When you need capability, you’d better have it.
Distribution Framework
Distribution Framework
Distribution Framework
Distribution Framework
Distribution Framework
• Graphical interface to Spark MLlib• Limited data manipulation functions• Scoring interface through PMML
Sparkling Water package:• Enables H2O to work with Spark RDDs, DataFrames• Publish Spark data structures as H2O Frames
Summary• Yes, distributed machine learning is necessary.• Need generalized distribution framework.• Today, Spark is the only game in town.• Race is on to deliver push-down Spark integration.