Top Banner
Deep Learning with H 2 O 0xdata, H2O.ai Scalable In-Memory Machine Learning Hadoop User Group, Chicago, 7/16/14 Arno Candel
42

H2O Distributed Deep Learning by Arno Candel 071614

Aug 27, 2014

Download

Software

0xdata

Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice in traditional business analytics.
This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of enterprise-scale problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization and optimization for class imbalance. World record performance on the classic MNIST dataset, best-in-class accuracy for eBay text classification and others showcase the power of this game changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core.

About the Speaker: Arno Candel

Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world's largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes.
He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: H2O Distributed Deep Learning by Arno Candel 071614

Deep Learning with H2O

!

0xdata, H2O.aiScalable In-Memory Machine Learning

!

Hadoop User Group, Chicago, 7/16/14

Arno Candel

Page 2: H2O Distributed Deep Learning by Arno Candel 071614

Who am I?

PhD in Computational Physics, 2005from ETH Zurich Switzerland

!

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 7 months at 0xdata/H2O - Machine Learning

!

15 years in HPC, C++, MPI, Supercomputing

@ArnoCandel

Page 3: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

OutlineIntro & Live Demo (5 mins)

Methods & Implementation (20 mins)

Results & Live Demos (25 mins)

MNIST handwritten digits

text classification

Weather prediction

Q & A (10 mins)

3

Page 4: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Distributed in-memory math platform ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK / API➔ Java, R, Scala, Python, JSON, Browser-based GUI

!Businesses can use ALL of their data (w or w/o Hadoop)

➔ Modeling without Sampling

Big Data + Better Algorithms ➔ Better Predictions

H2O Open Source in-memoryPrediction Engine for Big Data

4

Page 5: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

About H20 (aka 0xdata)Pure Java, Apache v2 Open Source Join the www.h2o.ai/community!

5

+1 Cyprien Noel for prior work

Page 6: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Customer Demands for Practical Machine Learning

6

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API / SDK Extensibility

H2O was developed by 0xdata to meet these requirements

Page 7: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

7

H2O H2O

Java

Page 8: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O Architecture

Distributed In-Memory K-V storeCol. compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

e.g. Deep Learning

8

MapReduce

Page 9: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O - The Killer App on Spark9

http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html

Page 10: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel 10

John Chambers (creator of the S language, R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

Page 11: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with R:Data resides on the H2O cluster!

Page 12: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O Deep Learning in Action

Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes

12

MNIST = Digitized handwritten digits database (Yann LeCun)

Live Demo Build a H2O Deep Learning model on MNIST train/test data

Data: 28x28=784 pixels with (gray-scale) values in 0…255

Yann LeCun: “Yet another advice: don't get fooled by people who claim to have a solution to Artificial General Intelligence. Ask them what error rate they get on MNIST or ImageNet.”

Page 13: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Wikipedia:Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations.

What is Deep Learning?

Example: Input data(image)

Prediction (who is it?)

13

Facebook's DeepFace (Yann LeCun) recognises faces as well as humans

Page 14: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Deep Learning is Trending

20132012

Google trends

2011

14

Businesses are usingDeep Learning techniques!

Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) !FBI FACE: $1 billion face recognition project !Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)

Page 15: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Deep Learning Historyslides by Yan LeCun (now Facebook)

15

Deep Learning wins competitions AND

makes humans, businesses and machines (cyborgs!?) smarter

Page 16: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

What is NOT DeepLinear models are not deep (by definition)

!

Neural nets with 1 hidden layer are not deep (no feature hierarchy)

!

SVMs and Kernel methods are not deep (2 layers: kernel + linear)

!

Classification trees are not deep (operate on original input space)

16

Page 17: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) !+ distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) !+ multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) !+ smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization)

!

= Top-notch prediction engine!

Deep Learning in H2O17

Page 18: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

“fully connected” directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2#connections

information flow

input/output neuronhidden neuron

4 3 2#neurons 3

Example Neural Network18

Page 19: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

age

income

employmentyj = tanh(sumi(xi*uij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yj*vjk)+ck)

vjk

zk pl

pl = softmax(sumk(zk*wkl)+dl)

wkl

softmax(xk) = exp(xk) / sumk(exp(xk))

“neurons activate each other via weighted sums”

Prediction: Forward Propagation

activation function: tanh alternative:

x -> max(0,x) “rectifier”

pl is a non-linear function of xi: can approximate ANY function

with enough layers!

bj, ck, dl: bias values(indep. of inputs)

19

married

single

Page 20: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

age

income

employment

xi

Automatic standardization of data xi: mean = 0, stddev = 1

!horizontalize categorical variables, e.g.

{full-time, part-time, none, self-employed} ->

{0,1,0} = part-time, {0,0,0} = self-employed

Automatic initialization of weights !

Poor man’s initialization: random weights wkl !

Default (better): Uniform distribution in+/- sqrt(6/(#units + #units_previous_layer))

Data preparation & InitializationNeural Networks are sensitive to numerical noise, operate best in the linear regime (not saturated)

20

married

single

wkl

Page 21: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness”

Training: Update Weights & Biases

Stochastic Gradient Descent: Update weights and biases via gradient of the error (via back-propagation):

For each training row, we make a prediction and compare with the actual label (supervised learning):

married10.8predicted actual

Objective: minimize prediction error (MSE or cross-entropy)

w <— w - rate * ∂E/∂w

1

21

single00.2

E

wrate

Page 22: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Backward Propagation

!∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi

= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi

Backprop: Compute ∂E/∂wi via chain rule going backwards

wi

net = sumi(wi*xi) + b

xiE = error(y)

y = activation(net)

How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?

Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!

22

Page 23: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodes/JVMs: sync

threads: async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w* = (w1+w2+w3+w4)/4

map: each node trains a copy of the weights

and biases with (some* or all of) its

local data with asynchronous F/J

threads

initial model: weights and biases w

updated model: w*

H2O atomic in-memoryK-V store

reduce: model averaging:

average weights and biases from all nodes,

speedup is at least #nodes/log(#rows) arxiv:1209.4129v3

Keep iterating over the data (“epochs”), score from time to time

Query & display the model via

JSON, WWW

2

2 431

1

1

1

43 2

1 2

1

i

*user can specify the number of total rows per MapReduce iteration

23

Page 24: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters, then continue training the most promising model(s)

RegularizationL1: penalizes non-zero weights L2: penalizes large weightsDropout: randomly ignore certain inputs

24

“Secret” Sauce to Higher Accuracy

Page 25: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Detail: Adaptive Learning Rate!

Compute moving average of ∆wi2 at time t for window length rho: !

E[∆wi2]t = rho * E[∆wi2]t-1 + (1-rho) * ∆wi2

!Compute RMS of ∆wi at time t with smoothing epsilon:

!RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents “freezing” (unlike ADAGRAD: no window)

Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time

RMS[∆wi]t-1

RMS[∂E/∂wi]t

rate(wi, t) =

Do the same for ∂E/∂wi, then obtain per-weight learning rate:

cf. ADADELTA paper

25

Page 26: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Detail: Dropout Regularization26

Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations.

!

age

income

employment

married

singleX

X

X

Testing: Use all activations, but reduce them by a factor p

(to “simulate” the missing activations during training).

cf. Geoff Hinton's paper

Page 27: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

MNIST: digits classification

Standing world record: Without distortions or convolutions, the best-ever published error rate on test set: 0.83% (Microsoft)

27

Time to check in on the demo!

Let’s see how H2O did in the past 20 minutes!

Page 28: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Frequent errors: confuse 2/7 and 4/9

H2O Deep Learning on MNIST: 0.87% test set error (so far)

28

test set error: 1.5% after 10 mins 1.0% after 1.5 hours 0.87% after 4 hours

World-class results!

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

Page 29: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, A. Candel

Weather Dataset29

Predict “RainTomorrow” from Temperature, Humidity, Wind, Pressure, etc.

Page 30: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, A. Candel

Live Demo: Weather Prediction

Interactive ROC curve with real-time updates

30

3 hidden Rectifier layers, Dropout,

L1-penalty

12.7% 5-fold cross-validation error is at least as good as GBM/RF/GLM models

5-fold cross validation

Page 31: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Live Demo: Grid Search

How did I find those parameters? Grid Search!(works for multiple hyper parameters at once)

31

Then continue training the best model

Page 32: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Use Case: Text Classification

Goal: Predict the item from seller’s text description

32

Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes

“Vintage 18KT gold Rolex 2 Tone in great condition”

Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0

vintagegold condition

Let’s see how H2O does on the ebay dataset!

Page 33: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Out-Of-The-Box: 11.6% test set error after 10 epochs! Predicts the correct class (out of 143) 88.4% of the time!

33

Note 2: No tuning was done(results are for illustration only)

Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes

Note 1: H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Use Case: Text Classification

Page 34: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Parallel Scalability (for 64 epochs on MNIST, with “0.87%” parameters)

34

Speedup

0.00

10.00

20.00

30.00

40.00

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node, 1 epoch per node per MapReduce)

2.7 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes

Page 35: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Tips for H2O Deep Learning!General: More layers for more complex functions (exp. more non-linearity) More neurons per layer to detect finer structure in data (“memorizing”) Add some regularization for less overfitting (smaller validation error) Do a grid search to get a feel for convergence, then continue training. Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5. Try Dropout (input: 20%, hidden: 50%) with test/validation set after finding good parameters for convergence on training set. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8, momentum_start = 0.5, momentum_stable = 0.99, momentum_ramp = 1/rate_annealing. Try balance_classes = true for imbalanced classes. Use force_load_balance and replicate_training_data for small datasets.

35

Page 36: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel 36

… and more docs coming soon!

Draft

All parameters are available from R…

H2O brings Deep Learning to R

Page 37: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

POJO Model Export for Production Scoring

37

Plain old Java code is auto-generated to take your H2O Deep Learning models into production!

Page 38: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

38

Toy example: Find anomaly in ECG heart beat data. First, train a model on what’s “normal”: 20 time-series samples of 210 data points each

Deep Auto-Encoder: Learn low-dimensional non-linear “structure” of the data that allows to reconstruct the orig. data

Also for categorical data!

Page 39: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Deep Learning Auto-Encoders for Anomaly Detection

39

Test set with anomaly

Test set prediction is reconstruction, looks “normal”

Found anomaly! large reconstruction error

Model of what’s “normal”

+

=>

Page 40: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O Steam: Scoring Platform

40

Page 41: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

H2O Steam: More Coming Soon!

41

Page 42: H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning, @ArnoCandel

Key Take-AwaysH2O is a distributed in-memory data science platform. It was designed for high-performance machine learning applications on big data. !

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data! !

Join our Community and Meetups! git clone https://github.com/0xdata/h2o http://docs.0xdata.com www.h2o.ai/community @hexadata

42