H2O Distributed Deep Learning by Arno Candel 071614

Deep Learning with H2O

!

0xdata, H2O.aiScalable In-Memory Machine Learning

!

Hadoop User Group, Chicago, 7/16/14

Arno Candel

http://www.h2o.ai

Who am I?

PhD in Computational Physics, 2005from ETH Zurich Switzerland

!

6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 7 months at 0xdata/H2O - Machine Learning

!

15 years in HPC, C++, MPI, Supercomputing

@ArnoCandel

H2O Deep Learning, @ArnoCandel

OutlineIntro & Live Demo (5 mins)

Methods & Implementation (20 mins)

Results & Live Demos (25 mins)

MNIST handwritten digits

text classification

Weather prediction

Q & A (10 mins)

3


Distributed in-memory math platform ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK / API➔ Java, R, Scala, Python, JSON, Browser-based GUI

!Businesses can use ALL of their data (w or w/o Hadoop)

➔ Modeling without Sampling

Big Data + Better Algorithms ➔ Better Predictions

H2O Open Source in-memoryPrediction Engine for Big Data

4


About H20 (aka 0xdata)Pure Java, Apache v2 Open Source Join the www.h2o.ai/community!

5

+1 Cyprien Noel for prior work

http://www.h2o.ai/community

http://www.infoq.com/presentations/api-memory-analytics


Customer Demands for Practical Machine Learning

6

Requirements Value

In-Memory Fast (Interactive)

Distributed Big Data (No Sampling)

Open Source Ownership of Methods

API / SDK Extensibility

H2O was developed by 0xdata to meet these requirements


H2O Integration

H2O

HDFS HDFS HDFS

YARN Hadoop MR

R ScalaJSON Python

Standalone Over YARN On MRv1

7

H2O H2O

Java


H2O Architecture

Distributed In-Memory K-V storeCol. compression

Machine Learning

Algorithms

R EngineNano fast

Scoring Engine

Prediction Engine

Memory manager

e.g. Deep Learning

8

MapReduce


H2O - The Killer App on Spark9

http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html

http://www.slideshare.net/0xdata/sparkling-water-5-2814

http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html

H2O Deep Learning, @ArnoCandel 10

John Chambers (creator of the S language, R-core member) names H2O R API in top three promising R projects

H2O R CRAN package

http://cran.r-project.org/web/packages/h2o/index.html

https://www.youtube.com/watch?v=I83UmshLXNE&feature=youtu.be&a


H2O + R = Happy Data Scientist

11

Machine Learning on Big Data with R:Data resides on the H2O cluster!


H2O Deep Learning in Action

Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes

12

MNIST = Digitized handwritten digits database (Yann LeCun)

Live Demo Build a H2O Deep Learning model on MNIST train/test data

Data: 28x28=784 pixels with (gray-scale) values in 0…255

Yann LeCun: “Yet another advice: don't get fooled by people who claim to have a solution to Artificial General Intelligence. Ask them what error rate they get on MNIST or ImageNet.”

http://yann.lecun.com/exdb/mnist/

http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun

http://yann.lecun.com/exdb/mnist/


Wikipedia:Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using

architectures composed of multiple non-linear transformations.

What is Deep Learning?

Example: Input data(image)

Prediction (who is it?)

13

Facebook's DeepFace (Yann LeCun) recognises faces as well as humans

http://en.wikipedia.org/wiki/Algorithm

http://en.wikipedia.org/wiki/Machine_learning

http://www.forbes.com/sites/amitchowdhry/2014/03/18/facebooks-deepface-software-can-match-faces-with-97-25-accuracy/


Deep Learning is Trending

20132012

Google trends

2011

14

Businesses are usingDeep Learning techniques!

Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) !FBI FACE: $1 billion face recognition project !Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)

http://www.wired.com/business/2014/01/google-buying-way-making-brain-irrelevant/

http://www.dailytech.com/FBIs+1B+USD+Facial+Recognition+Project+Enjoys+Strong+Bipartisan+Support/article27631.htm

http://www.technologyreview.com/news/527301/chinese-search-giant-baidu-hires-man-behind-the-google-brain/


Deep Learning Historyslides by Yan LeCun (now Facebook)

15

Deep Learning wins competitions AND

makes humans, businesses and machines (cyborgs!?) smarter

http://www.slideshare.net/yandex/yann-le-cun


What is NOT DeepLinear models are not deep (by definition)

!

Neural nets with 1 hidden layer are not deep (no feature hierarchy)

!

SVMs and Kernel methods are not deep (2 layers: kernel + linear)

!

Classification trees are not deep (operate on original input space)

16


1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) !+ distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) !+ multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) !+ smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization)

!

= Top-notch prediction engine!

Deep Learning in H2O17


“fully connected” directed graph of neurons

age

income

employment

married

single

Input layerHidden layer 1

Hidden layer 2

Output layer

3x4 4x3 3x2#connections

information flow

input/output neuronhidden neuron

4 3 2#neurons 3

Example Neural Network18


age

income

employmentyj = tanh(sumi(xi*uij)+bj)

uij

xi

yj

per-class probabilities sum(pl) = 1

zk = tanh(sumj(yj*vjk)+ck)

vjk

zk pl

pl = softmax(sumk(zk*wkl)+dl)

wkl

softmax(xk) = exp(xk) / sumk(exp(xk))

“neurons activate each other via weighted sums”

Prediction: Forward Propagation

activation function: tanh alternative:

x -> max(0,x) “rectifier”

pl is a non-linear function of xi: can approximate ANY function

with enough layers!

bj, ck, dl: bias values(indep. of inputs)

19

married

single


age

income

employment

xi

Automatic standardization of data xi: mean = 0, stddev = 1

!horizontalize categorical variables, e.g.

{full-time, part-time, none, self-employed} ->

{0,1,0} = part-time, {0,0,0} = self-employed

Automatic initialization of weights !

Poor man’s initialization: random weights wkl !

Default (better): Uniform distribution in+/- sqrt(6/(#units + #units_previous_layer))

Data preparation & InitializationNeural Networks are sensitive to numerical noise, operate best in the linear regime (not saturated)

20

married

single

wkl


Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness”

Training: Update Weights & Biases

Stochastic Gradient Descent: Update weights and biases via gradient of the error (via back-propagation):

For each training row, we make a prediction and compare with the actual label (supervised learning):

married10.8predicted actual

Objective: minimize prediction error (MSE or cross-entropy)

w <— w - rate * ∂E/∂w

1

21

single00.2

E

wrate


Backward Propagation

!∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi

= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi

Backprop: Compute ∂E/∂wi via chain rule going backwards

wi

net = sumi(wi*xi) + b

xiE = error(y)

y = activation(net)

How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?

Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!

22


H2O Deep Learning Architecture

K-V

K-V

HTTPD

HTTPD

nodes/JVMs: sync

threads: async

communication

w

w w

w w w w

w1 w3 w2w4

w2+w4w1+w3

w* = (w1+w2+w3+w4)/4

map: each node trains a copy of the weights

and biases with (some* or all of) its

local data with asynchronous F/J

threads

initial model: weights and biases w

updated model: w*

H2O atomic in-memoryK-V store

reduce: model averaging:

average weights and biases from all nodes,

speedup is at least #nodes/log(#rows) arxiv:1209.4129v3

Keep iterating over the data (“epochs”), score from time to time

Query & display the model via

JSON, WWW

2

2 431

1

1

1

43 2

1 2

1

i

*user can specify the number of total rows per MapReduce iteration

23


Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history

Grid Search and Checkpointing Run a grid search to scan many hyper-parameters, then continue training the most promising model(s)

RegularizationL1: penalizes non-zero weights L2: penalizes large weightsDropout: randomly ignore certain inputs

24

“Secret” Sauce to Higher Accuracy

http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf

http://arxiv.org/pdf/1207.0580.pdf


Detail: Adaptive Learning Rate!

Compute moving average of ∆wi2 at time t for window length rho: !

E[∆wi2]t = rho * E[∆wi2]t-1 + (1-rho) * ∆wi2

!Compute RMS of ∆wi at time t with smoothing epsilon:

!RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )

Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents “freezing” (unlike ADAGRAD: no window)

Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time

RMS[∆wi]t-1

RMS[∂E/∂wi]t

rate(wi, t) =

Do the same for ∂E/∂wi, then obtain per-weight learning rate:

cf. ADADELTA paper

25




Detail: Dropout Regularization26

Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations.

!

age

income

employment

married

singleX

X

X

Testing: Use all activations, but reduce them by a factor p

(to “simulate” the missing activations during training).

cf. Geoff Hinton's paper

http://arxiv.org/pdf/1207.0580.pdf



MNIST: digits classification

Standing world record: Without distortions or convolutions, the best-ever published error rate on test set: 0.83% (Microsoft)

27

Time to check in on the demo!

Let’s see how H2O did in the past 20 minutes!

http://research.microsoft.com/pubs/152133/deepconvexnetwork-interspeech2011-pub.pdf


Frequent errors: confuse 2/7 and 4/9

H2O Deep Learning on MNIST: 0.87% test set error (so far)

28

test set error: 1.5% after 10 mins 1.0% after 1.5 hours 0.87% after 4 hours

World-class results!

No pre-training No distortions

No convolutions No unsupervised

training

Running on 4 nodes with 16 cores each

H2O Deep Learning, A. Candel

Weather Dataset29

Predict “RainTomorrow” from Temperature, Humidity, Wind, Pressure, etc.

H2O Deep Learning, A. Candel

Live Demo: Weather Prediction

Interactive ROC curve with real-time updates

30

3 hidden Rectifier layers, Dropout,

L1-penalty

12.7% 5-fold cross-validation error is at least as good as GBM/RF/GLM models

5-fold cross validation


Live Demo: Grid Search

How did I find those parameters? Grid Search!(works for multiple hyper parameters at once)

31

Then continue training the best model


Use Case: Text Classification

Goal: Predict the item from seller’s text description

32

Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes

“Vintage 18KT gold Rolex 2 Tone in great condition”

Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0

vintagegold condition

Let’s see how H2O does on the ebay dataset!


Out-Of-The-Box: 11.6% test set error after 10 epochs! Predicts the correct class (out of 143) 88.4% of the time!

33

Note 2: No tuning was done(results are for illustration only)

Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes

Note 1: H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)

Use Case: Text Classification


Parallel Scalability (for 64 epochs on MNIST, with “0.87%” parameters)

34

Speedup

0.00

10.00

20.00

30.00

40.00

1 2 4 8 16 32 63

H2O Nodes

(4 cores per node, 1 epoch per node per MapReduce)

2.7 mins

Training Time

0

25

50

75

100

1 2 4 8 16 32 63

H2O Nodes

in minutes


Tips for H2O Deep Learning!General: More layers for more complex functions (exp. more non-linearity) More neurons per layer to detect finer structure in data (“memorizing”) Add some regularization for less overfitting (smaller validation error) Do a grid search to get a feel for convergence, then continue training. Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5. Try Dropout (input: 20%, hidden: 50%) with test/validation set after finding good parameters for convergence on training set. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8, momentum_start = 0.5, momentum_stable = 0.99, momentum_ramp = 1/rate_annealing. Try balance_classes = true for imbalanced classes. Use force_load_balance and replicate_training_data for small datasets.

35

H2O Deep Learning, @ArnoCandel 36

… and more docs coming soon!

Draft

All parameters are available from R…

H2O brings Deep Learning to R


POJO Model Export for Production Scoring

37

Plain old Java code is auto-generated to take your H2O Deep Learning models into production!


Deep Learning Auto-Encoders for Anomaly Detection

38

Toy example: Find anomaly in ECG heart beat data. First, train a model on what’s “normal”: 20 time-series samples of 210 data points each

Deep Auto-Encoder: Learn low-dimensional non-linear “structure” of the data that allows to reconstruct the orig. data

Also for categorical data!


Deep Learning Auto-Encoders for Anomaly Detection

39

Test set with anomaly

Test set prediction is reconstruction, looks “normal”

Found anomaly! large reconstruction error

Model of what’s “normal”

+

=>


H2O Steam: Scoring Platform

40


H2O Steam: More Coming Soon!

41


Key Take-AwaysH2O is a distributed in-memory data science platform. It was designed for high-performance machine learning applications on big data. !

H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data! !

Join our Community and Meetups! git clone https://github.com/0xdata/h2o http://docs.0xdata.com www.h2o.ai/community @hexadata

42

https://github.com/0xdata/h2o

http://docs.0xdata.com

http://www.h2o.ai/community

https://twitter.com/hexadata

H2O Distributed Deep Learning by Arno Candel 071614

Software

h2o deep learning

arnocandel

deep learning

machine learning

supervised

arnocandel

big data

grid search