Top Banner
© 2017 KNIME AG. All Rights Reserved. Integrating high-performance machine learning: H2O and KNIME Mark Landry (H2O), Christian Dietz (KNIME)
19

Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Oct 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved.

Integrating high-performance machine learning: H2O and KNIME

Mark Landry (H2O), Christian Dietz (KNIME)

Page 2: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Speed

H2O: in-memory machine learning platform designed for speed on distributed systems

2

Accuracy

Page 3: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java

Object

YourImagination

Data Prep Export:Plain Old Java

Object

Local

SQL

High Level Architecture

3

Page 4: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Distributed Algorithms

• Foundation for In-Memory Distributed Algorithm Calculation - Distributed Data Frames and columnar compression

• All algorithms are distributed in H2O: GBM, GLM, DRF, Deep Learning and more. Fine-grained map-reduce iterations.

• Only enterprise-grade, open-source distributed algorithms in the market

User Benefits

Advantageous Foundation

• “Out-of-box” functionalities for all algorithms (NO MORE SCRIPTING) and uniform interface across all languages: R, Python, Java

• Designed for all sizes of data sets, especially large data• Highly optimized Java code for model exports• In-house expertise for all algorithms

Parallel Parse into Distributed Rows

Fine Grain Map Reduce Illustration: Scalable Distributed Histogram Calculation for GBM

Fou

nd

atio

n fo

r D

istr

ibu

ted

Alg

ori

thm

s

4

Page 5: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

5

Scientific Advisory Council

Page 6: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

6

X = million of samples

Gradient Boosting Machine Benchmark(also available for GLM and Random Forest)

Time (s) AUC

X = million of samples

Machine Learning Benchmarks(https://github.com/szilard/benchm-ml)

Page 7: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

Supervised Learning

• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes

Statistical Analysis

Ensembles

• Distributed Random Forest: Classification or regression models

• Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations

Deep Neural Networks

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Unsupervised Learning

• K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k

Clustering

Dimensionality Reduction

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

Anomaly Detection

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

7

H2O Algorithms

Page 8: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 8

H2O in KNIME

Live Demo

Page 9: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 9

H2O in KNIME

• Offer our users high-performance machine learning algorithms from H2O in KNIME

• Allow to mix & match with other KNIME functionality

– Data wrangling KNIME Analytics Platform functionality

– KNIME Big-Data Connectors

– Text Mining, Image Processing, Cheminformatics, …

– and more!

Page 10: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 10

H2O in KNIME

Live Demo

Page 11: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 11

H2O in KNIME – Cross Validation

Page 12: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 12

H2O in KNIME – Cross Validation

Page 13: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 13

H2O in KNIME – Cross Validation

Page 14: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 14

H2O in KNIME – Parameter Optimization

Page 15: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 15

H2O in KNIME – Parameter Optimization

Page 16: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 16

H2O in KNIME – Nodes in KNIME 3.4

Page 17: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 17

H2O in KNIME – What’s cooking?

Page 18: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 18

H2O in KNIME – What’s cooking?

Page 19: Integrating high-performance machine learning: H2O and KNIME · • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean,

© 2017 KNIME AG. All Rights Reserved. 19

Thank you!