Top Banner
Continuous Machine and Deep Learning at Scale With Apache Ignite Ken Cottrell Solution Architect [email protected]
28

Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

Jun 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

Continuous Machine and Deep Learning at Scale WithApache Ignite

Ken Cottrell

Solution Architect

[email protected]

Page 2: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Agenda

1

▪ Continuous Machine Learning / Deep Learning Introduction

▪ Overview of Apache Ignite Continuous ML/DL Capabilities

▪ Demo & API discussion

▪ Q & A

Page 3: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Why Machine Learning at Scale?

2

Scalability

• Data exceed capacity of single server

• Burden for development and business

Models trained & then deployed in

different systems

• Move data out for training

• Wait for training to complete

• Redeploy models in production

Page 4: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Machine Learning Pipelines: where is the time spent?

Page 5: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

App

Continuous Machine Learning at Scale

Periodic

update of

models

Periodic ETL

of terabytes

of data

Loading data

for training

Model training

& testing

Storing and

processing

working set

Before

Storing and

processing

working set

Instant

updates of

models

After (With CL)

App ML/DL

Engine

Model training & testing

No ETL

Page 6: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

-

50,000

100,000

150,000

200,000

Ap

r-1

4

Jun

-14

Au

g-1

4

Oct-

14

De

c-1

4

Fe

b-1

5

Ap

r-1

5

Jun

-15

Au

g-1

5

Oct-

15

De

c-1

5

Fe

b-1

6

Ap

r-1

6

Jun

-16

Au

g-1

6

Oct-

16

De

c-1

6

Fe

b-1

7

Ap

r-1

7

Jun

-17

Au

g-1

7

Oct-

17

De

c-1

7

Fe

b-1

8

Ap

r-1

8

Jun

-18

Au

g-1

8

Oct-

18

De

c-1

8

Apache Ignite Is a Top 5 Apache Project

Est. 15M today, Apache site

and Docker siteTop 5 Dev Mailing Lists

1.

2.

3.

4.

5.

Top 5 User Mailing Lists

1.

2.

3.

4.

5.

Monthly Ignite/GridGain Downloads

From January 1, 2019 Apache Software Foundation Blog Post:

“Apache in 2018 – By The Digits”

A Top 5 Apache Software Foundation Project

Page 7: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Logistics & Transportation

Apache Ignite Users

IoT

AdTech/Media/Entertainment

Pharma & Healthcare

Reliance

Financial Services

FinTech

Software/Cloud

Telecom & Mobile

IoT

AdTech / Media / Entertainment

Logistics & Transportation

eCommerce & Retail

Pharma & Healthcare

Page 8: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite In-Memory Computing Platform

Mainframe NoSQL HadoopIgnite Persistence

Persistent Layer

RDBMS

Machine and Deep Learning

EventsStreamingMessagingTransactio

nsSQLKey-Value

Service GridCompute Grid

Application Layer

Web SaaS SocialMobile IoT

Rolli

ng U

pgra

des

Securi

ty &

Aud

itin

g

Monitoring &

Manag

em

ent

Segm

enta

tion P

rote

ction

Data

Cente

rR

eplic

ation

Netw

ork

Backups

Full,

Incre

menta

l, C

ontinuous B

ackups

Poin

t-in

-Tim

e R

ecovery

Hete

rogeneous R

ecovery

In-Memory Data Store

GridGain Enterprise FeaturesApache Ignite Features

Page 9: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems8

Overview of Apache Ignite Continuous ML/DL Capabilities

Page 10: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Continuous Learning framework

Transactional Persistence

Distributed Machine Learning Datasets

TensorFLowRegressionsK-Means Decision Trees

In-Memory Data Store

Distributed In-Memory Machine and Deep Learning

Compute and Service Grid

C++.NETJava PythonBinary Protocal

(Thin client)

Distributed

Algorithms

Large Scale

Parallelization

Multi-language

Support

No ETL

Distributed

Dataset based

on partitioned

caches

Page 11: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems10

Partitions Distribution and Replication

Node 1 Node 2

Node 3 Node 4

0 1

2 3

0

1

2

3

Primary

Backup

Co-Located by

Partition:• Transactional

Data

• Vectorized Data

• Training context

data

• Other

Computation

functions

Page 12: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Redundant

Parallel jobs▪ Pre-Process

▪ Vectorize

▪ Train

Redundant

Parallel jobs▪ Pre-Process

▪ Vectorize

▪ Train

Continuous Learning enabled with Partitioned Datasets

Ignite Node

P2 C D

Ignite Node

P1 C DApplication

P = Partition

C = Partition Context

D = Partition Data

D* = Local ETL

Replicated,

Parallel jobs▪ Pre-Process

▪ Vectorize

▪ Train

Map Training

Reduce Training Results

Page 13: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Distributed Training: Clustering

• K-means (Centroid Mean)

• GMM (Centroid Mean + Variance)

• Use Cases - OLTP and other tabular data that need to be Labeled

– Customer Segmentation

– Anomaly Detection

– Network throughput characterization

Page 14: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Distributed Training: Classification

• Logistic Regression & Naive Bayes

• SVM, KNN, ANN

• Decision trees & Random Forest

• Use cases - Operational (OLTP) data

prediction:

– Fraud detection

– Credit Card Scoring

– Clinical Trials

– Customer Segmentation

Page 15: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Distributed Training: Regression

• KNN & Linear Regressions

• Decision tree regression

• Random forest regression

• Gradient-boosted tree regression

• Use cases - Operational data (OLTP)

predictions– Trend analysis

– Financial forecasting

– Time series prediction

– Response modeling (pharma etc)

Page 16: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite: TensorFlow Integration

15

>>> import tensorflow as tf

>>> from tensorflow.contrib.ignite import IgniteDataset

>>>

>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")

>>> iterator = dataset.make_one_shot_iterator()

>>> next_obj = iterator.get_next()

>>>

>>> with tf.Session() as sess:

>>> for _ in range(3):

>>> print(sess.run(next_obj))

{'key': 1, 'val': {'NAME': b'WARM KITTY'}}

{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}

{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Use Cases - Operational data “High

dimension” data (Images, Text, Audio,

speech)• Image data classification

• Natural Language Processing Clinical notes

• Document Classification, Free Form text

Page 17: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Distributed PreProcessing: Normalization

Page 18: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Distributed Preprocessing: Scaling

https://medium.com/@nsethi610/data-cleaning-scale-and-normalize-data-4a7c781dd628

Page 19: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Apache Ignite Distributed preprocessing: One-Hot Encoder

* Also included:

String Encoding

Page 20: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Achieving Continuous ML/DL at Scale: Architectural Considerations / Trade-Offs

19

• Operational Data Models: from centralized to parallelized

– De-normalization Data Affinity for parallel Loads, Queries, Updates, Joins

– Horizontal scale-out

• Done locally in node: data partition + preprocessing + training + inferencing

– Reduces data shuffling over the network between the cluster and application

• ML pipeline enhancements

– Co-Located & Distributed processing of all ML steps: ingest to inferencing

– ML model performance measured, and updatable, with nearby transaction data

Page 21: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems20

Demo & API discussion

Page 22: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

package org.apache.ignite.examples.ml

21

Adding your own Preprocessor and Algorithm to a Dataset

• dataset/AlgorithmSpecificDatasetExample.java

Passing custom preprocessor classes to the cluster

• environment/TrainingWithCustomPreprocessorsExample.java

TensorFlow data set , inferencing at the cluster nodes

• inference/TensorFlowDistributedInferenceExample.java

Decision tree

• tree/FraudDetectionExample.java

End-to-End Model Prep & Training Tutorial (shows feature preprocessing, transformation, different algorithm comparisons, accuracy metrics, pipelines)

• tutorial/*.java // pipeline to preprocess, train,

// & evaluate Titanic passenger data

Page 23: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Ignite ML API to Update the model

SVMLinearClassificationTrainer trainer = new

SVMLinearClassificationTrainer();

SVMLinearClassificationModel mdl1 =

trainer.fit(ignite, dataCache1, vectorizer);

SVMLinearClassificationModel mdl2 =

trainer.update(mdl1, ignite, dataCache2,

vectorizer);

DatasetTraininer interface:

(Some Constraints according to the Algorithm)

Online / Online Batch with new data

• Centroid updates – KMeans, ANN

• Add new dataset - KNN

• Update with new Gradient – NN, Log

Regression, Linear Regression

• Increment to Current state - SVM, GDB

• Decision Tree – retrain

• Random Forest – adds new DT, may discard

other DTs for size management

Page 24: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Demo tutorial sample

23

To run this example:• Import this directory with pom.xml into your favorite IDE as a

Maven project– <path>\apache-ignite-2.7.6-bin\examples\pom.xml

• I’ll run this job on a single node inside my laptop on Eclipse (normally you would run jobs on a cluster of nodes)

– Each of these Steps can be run independently or all together with TutorialStepByStepExample.java

– Widely used Titanic data set (we include it here)

• Discussion of how Apache Ignite API can be invoked by 3rd party Auto ML and other application wrappers

• Compare the Accuracy obtained different ML steps– Accuracy defined as % correct predictions versus ground truth

– Different algorithms and different preprocessing

– Effects of Test / Train split on Overfitting

Page 25: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems24

Apache Ignite Spark integration

Write to Ignite DataFrame from

within Spark session

Read from same Ignite DataFrame from

another Spark Session

• DF (and RDD) shared across

sessions

• SQL with Indexing for faster queries

• Ignite DF are mutable

Page 26: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

To Summarize: Apache Ignite for Continuous Learning at Scale

25

Massive Scale for Memory, Storage, Computation• Massive Throughput with minimal ETL

• Massive operational data sizes + in-place parallel processing

• Faster cycle times from transactions, ML/DL dataset extraction, predictions

Integrates with Existing ML / DL operations• Low-level Distributed APIs to integrate with Auto ML and other Data Science

workflows

• For End-Users: Python API to manage Cache, Datasets, SQL, ML

• Apache Ignite integrations to accelerate Spark, TensorFlow pipelines; including

Model imports from other tool sets

Page 27: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems

Resources

26

• Documentation:– https://apacheignite.readme.io/docs

• Python support– https://github.com/gridgain/ml-python-api

• Examples and Tutorials:– https://github.com/apache/ignite/tree/master/examples/s

rc/main/java/org/apache/ignite/examples/ml

• Details on TensorFlow– https://medium.com/tensorflow/tensorflow-on-apache-

ignite-99f1fc60efeb

Page 28: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging

2019 © GridGain Systems27

Q & A