Top Banner
Using Distributed Computing for MLaaS Michael Salvador Svanholm, Consultant
54

Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Jul 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Using Distributed Computing for MLaaS Michael Salvador Svanholm, Consultant

Page 2: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

We have used Apache to distribute our Machine learning

tools.

So far, we have created: Anomaly Detection and Classification.

Distributed computing is a method to deliver

results fast, when facing a growing amount of data

Page 3: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

We have used Apache to distribute our Machine learning

tools.

So far, we have created: Anomaly Detection and Classification.

Distributed computing is a method to deliver

results fast, when facing a growing amount of data

Ideally, clients can use these tools without help, if they

“know” their own data.

Page 4: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

On the other hand, anomalies can also be “data of interest” which means, that a lot of value can

potentially come from examining them.

Anomaly Detection using K-means clustering can be used to clean data

Page 5: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

On the other hand, anomalies can also be “data of interest” which means, that a lot of value can

potentially come from examining them.

Anomaly Detection using K-means clustering can be used to clean data

These

data points are

anomalies/outliers

Page 6: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

We found that some companies are anomalies, compared to others, on a subset of features in the CVR-

data from the Danish Business Authority.

Detecting anomalies in the Danish Business Registry Data (CVR-data)

Prototypes that define this cluster

Outliers in this particular cluster

Page 7: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Bankruptcy prediction using classification on the Danish Business Registry Data (CVR-data)

Our analysis shows that the latest amount of “årsværk” and number of “closed production units” are

significant in respect to keeping a company from going bankrupt.

On the other hand, number of “open production units”, the second latest amount of “årsværk” are

significant in respect to a company that has gone bankrupt.

Page 8: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Semi supervised learning:We can use a few labeled points with unlabeled data.

What’s next?

Black/White data points: Labeled data.

Grey data points: Unlabeled data.

Created by: Techerin

Page 9: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Thank you for

your attention

Page 10: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Big Data in the Food Supply Chain

Methods for handling missing data

Niels Bruun Ipsen

Page 11: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data2 DTU Compute, Technical University of Denmark

Setting

• Increased use of Big Data methods within the Food Supply Chain[1][2]

Page 12: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data3 DTU Compute, Technical University of Denmark

Setting

• Increased use of Big Data methods within the Food Supply Chain[1][2]

• Missing data reasons: corrupted, expensive, unknown

Page 13: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data4 DTU Compute, Technical University of Denmark

Setting

• Increased use of Big Data methods within the Food Supply Chain[1][2]

• Missing data reasons: corrupted, expensive, unknown

• Influence by missing data limits performance [3]

Page 14: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data5 DTU Compute, Technical University of Denmark

Setting

• Increased use of Big Data methods within the Food Supply Chain[1][2]

• Missing data reasons: corrupted, expensive, unknown

• Influence by missing data limits performance [3]

• How to handle missing data in a formal way in a Big Data context?

Page 15: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data6 DTU Compute, Technical University of Denmark

Setting

• Increased use of Big Data methods within the Food Supply Chain[1][2]

• Missing data reasons: corrupted, expensive, unknown

• Influence by missing data limits performance [3]

• How to handle missing data in a formal way in a Big Data context?

Missing data methods

PPCA

FA

Mixtures of PPCA or FA

ARD

Missing data process simulation

Page 16: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data7 DTU Compute, Technical University of Denmark

Project Outline

• Probabilistic PCA

– Subspace estimation

– Posterior probability distribution

– Robustness

Page 17: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data8 DTU Compute, Technical University of Denmark

Project Outline

• Probabilistic PCA

– Subspace estimation

– Posterior probability distribution

– Robustness

• Generalization

– Factor Analysis, mixtures

Page 18: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data9 DTU Compute, Technical University of Denmark

Project Outline

• Probabilistic PCA

– Subspace estimation

– Posterior probability distribution

– Robustness

• Generalization

– Factor Analysis, mixtures

• Automation

– Automatic Relevance Determination, MLaaS

Page 19: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data10 DTU Compute, Technical University of Denmark

Project Outline

• Probabilistic PCA

– Subspace estimation

– Posterior probability distribution

– Robustness

• Generalization

– Factor Analysis, mixtures

• Automation

– Automatic Relevance Determination, MLaaS

• Process estimation

Page 20: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data11 DTU Compute, Technical University of Denmark

Page 21: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

29/03/2017Methods for missing data12 DTU Compute, Technical University of Denmark

Page 22: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Thank you

[1] Lokers, Rob, et al. "Analysis of Big Data technologies for use in agro-environmental science.”

[2] Marvin, Hans JP, et al. "A holistic approach to food safety risks: Food fraud as an example.”

[3] Anagnostopoulos, Christos, and Peter Triantafillou. "Scaling out big data missing value imputations: pythia vs. godzilla.”

Page 23: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Integrating Big Data in Food

Philip Johan Havemann Jørgensen, Ph.d. student

Page 24: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Philip Johan Havemann Jørgensen, Ph.d. student 2

Page 25: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Philip Johan Havemann Jørgensen, Ph.d. student 3

Page 26: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Measurements for mass spectrum × retention time

Philip Johan Havemann Jørgensen, Ph.d. student 4

Page 27: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Measurements for mass spectrum × retention time × samples

Philip Johan Havemann Jørgensen, Ph.d. student 5

Page 28: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Tensor Factorization (Parafac2):

Xk = ADkFTk

Key challenge: Determining the correct number of components(Trying to use a probabilistic formulation to solve it)

Philip Johan Havemann Jørgensen, Ph.d. student 6

Page 29: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

I Capturing relations in multimodal dataI Data Fusion

I Improving Predictive AnalysisI Transfer Learning/Domain Adaptation

Philip Johan Havemann Jørgensen, Ph.d. student 7

Page 30: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Thank you!

Philip Johan Havemann Jørgensen, Ph.d. student 8

Page 31: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing

Jeppe Nørregaard

PhD Student with Lars Kai Hansen as supervisor

- Computers and Semantics in Text

Page 32: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing2 DTU Compute, Technical University of Denmark 29-03-2017

People interact with computers

Where do you

want to go on

holiday?

Page 33: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing3 DTU Compute, Technical University of Denmark 29-03-2017

People interact with computers

Where do you

want to go on

holiday?

… and other people

Page 34: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing4 DTU Compute, Technical University of Denmark 29-03-2017

People interact with computers

Where do you

want to go on

holiday?

Doesn’t know what it’s selling

… and other people

Page 35: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing5 DTU Compute, Technical University of Denmark

Motivations

29-03-2017

Imagine a computer that…

• “knew” Wikipedia

Page 36: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing6 DTU Compute, Technical University of Denmark 29-03-2017

Page 37: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing7 DTU Compute, Technical University of Denmark 29-03-2017

Page 38: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing8 DTU Compute, Technical University of Denmark

Page 39: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing9 DTU Compute, Technical University of Denmark

Page 40: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing10 DTU Compute, Technical University of Denmark

Fake News

~3.500 personnel == 3.600 tanks ?

29-03-2017

Page 41: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing11 DTU Compute, Technical University of Denmark

Motivations

Imagine a computer that…

• “knew” Wikipedia

29-03-2017

Page 42: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing12 DTU Compute, Technical University of Denmark

Motivations

Imagine a computer that…

• “knew” Wikipedia

• could fact check news

29-03-2017

Page 43: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing13 DTU Compute, Technical University of Denmark

Motivations

Imagine a computer that…

• “knew” Wikipedia

• could fact check news

• perhaps a little Turing test?

29-03-2017

Page 44: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing14 DTU Compute, Technical University of Denmark

We are currently working on

Giving computers their own memory

29-03-2017

Page 45: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing15 DTU Compute, Technical University of Denmark

Exam time!

29-03-2017

All knowledge in the universe

Page 46: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing16 DTU Compute, Technical University of Denmark

Exam time!

29-03-2017

All knowledge you need

Page 47: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Knowing Nothing17 DTU Compute, Technical University of Denmark

Differentiable Neural Computers[0]

29-03-2017

Write

Read

Memory

We don’t need to touch this

Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory.“ Nature 538.7626 (2016): 471-476.

[0]

Page 48: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Thank You

Jeppe Nørregaard

Page 49: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Automating unsupervised learning

DABAI

Frans Zdyb

Page 50: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Data

Insight

Preprocessing

Domain knowledge

Load into memoryOnline stream

Cluster

Sanitize input

Vector embeddingOutlier detection

Choose a loss functionSpecify labels

Modeling

Formulate priorsTransfer learning

Meta learning

Engineer featuresLearn model parameters

Tune hyperparameters

Build an ensemble

Evaluation

Measure model fit

Measure generalization performance

Measure robustness

Measure scalability

Explanation

Visualisations

Case-based explanations

Report generation

Informed decisions

Machine Learning as a Service

Page 51: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Supervised learning finds predictive relations between variables,

There are systems that do this automatically.

Auto-sklearn1

a wrapper around the scikit-learn, uses

meta-learning, Bayesian optimization and ensemble building

to outperform the state-of-the-art on the ChaLearn AutoML Challenge.

Classification works really well. Regression is coming along nicely.

1 “Efficient and Robust Automated Machine Learning”, Hutter et al., 2015

Page 52: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Unsupervised learning finds generalizable dependencies between variables,

Automating it is largely unexplored territory.

Hypothesis:

● Generalize to unseen data● Robust to different training sets● Detect outliers● Aid in supervised learning

Bayesian Optimization with Gaussian Process

We can use Bayesian Optimization to tune unsupervised models

Page 53: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Python + Numpy + Scipy

TensorFlow for distributed numerical computing and automatic differentiation

Edward2

for probabilistic modeling, built on top of TensorFlowGraphical modelsNeural networksBayesian non-parametrics

Variational InferenceMCMC

GPyOpt3

for Bayesian OptimizationEasy to useParallelUp to date

2 Edward: A library for probabilistic modeling, inference, and criticism, 2016, edwardlib.org3 GPyOpt: A Bayesian Optimization framework in python, 2016, sheffieldml.github.io/GPyOpt/

Page 54: Using Distributed Computing for MLaaSdabai.dk/sites/default/files/events/Young scientists_2.pdf · Methods for handling missing data Niels Bruun Ipsen. 2 DTU Compute, Technical University

Thank you!