Using Distributed Computing for MLaaS Michael Salvador Svanholm, Consultant
Using Distributed Computing for MLaaS Michael Salvador Svanholm, Consultant
We have used Apache to distribute our Machine learning
tools.
So far, we have created: Anomaly Detection and Classification.
Distributed computing is a method to deliver
results fast, when facing a growing amount of data
We have used Apache to distribute our Machine learning
tools.
So far, we have created: Anomaly Detection and Classification.
Distributed computing is a method to deliver
results fast, when facing a growing amount of data
Ideally, clients can use these tools without help, if they
“know” their own data.
On the other hand, anomalies can also be “data of interest” which means, that a lot of value can
potentially come from examining them.
Anomaly Detection using K-means clustering can be used to clean data
On the other hand, anomalies can also be “data of interest” which means, that a lot of value can
potentially come from examining them.
Anomaly Detection using K-means clustering can be used to clean data
These
data points are
anomalies/outliers
We found that some companies are anomalies, compared to others, on a subset of features in the CVR-
data from the Danish Business Authority.
Detecting anomalies in the Danish Business Registry Data (CVR-data)
Prototypes that define this cluster
Outliers in this particular cluster
Bankruptcy prediction using classification on the Danish Business Registry Data (CVR-data)
Our analysis shows that the latest amount of “årsværk” and number of “closed production units” are
significant in respect to keeping a company from going bankrupt.
On the other hand, number of “open production units”, the second latest amount of “årsværk” are
significant in respect to a company that has gone bankrupt.
Semi supervised learning:We can use a few labeled points with unlabeled data.
What’s next?
Black/White data points: Labeled data.
Grey data points: Unlabeled data.
Created by: Techerin
Thank you for
your attention
Big Data in the Food Supply Chain
Methods for handling missing data
Niels Bruun Ipsen
29/03/2017Methods for missing data2 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
29/03/2017Methods for missing data3 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
29/03/2017Methods for missing data4 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
• Influence by missing data limits performance [3]
29/03/2017Methods for missing data5 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
• Influence by missing data limits performance [3]
• How to handle missing data in a formal way in a Big Data context?
29/03/2017Methods for missing data6 DTU Compute, Technical University of Denmark
Setting
• Increased use of Big Data methods within the Food Supply Chain[1][2]
• Missing data reasons: corrupted, expensive, unknown
• Influence by missing data limits performance [3]
• How to handle missing data in a formal way in a Big Data context?
Missing data methods
PPCA
FA
Mixtures of PPCA or FA
ARD
Missing data process simulation
29/03/2017Methods for missing data7 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
29/03/2017Methods for missing data8 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
• Generalization
– Factor Analysis, mixtures
29/03/2017Methods for missing data9 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
• Generalization
– Factor Analysis, mixtures
• Automation
– Automatic Relevance Determination, MLaaS
29/03/2017Methods for missing data10 DTU Compute, Technical University of Denmark
Project Outline
• Probabilistic PCA
– Subspace estimation
– Posterior probability distribution
– Robustness
• Generalization
– Factor Analysis, mixtures
• Automation
– Automatic Relevance Determination, MLaaS
• Process estimation
29/03/2017Methods for missing data11 DTU Compute, Technical University of Denmark
29/03/2017Methods for missing data12 DTU Compute, Technical University of Denmark
Thank you
[1] Lokers, Rob, et al. "Analysis of Big Data technologies for use in agro-environmental science.”
[2] Marvin, Hans JP, et al. "A holistic approach to food safety risks: Food fraud as an example.”
[3] Anagnostopoulos, Christos, and Peter Triantafillou. "Scaling out big data missing value imputations: pythia vs. godzilla.”
Integrating Big Data in Food
Philip Johan Havemann Jørgensen, Ph.d. student
Philip Johan Havemann Jørgensen, Ph.d. student 2
Philip Johan Havemann Jørgensen, Ph.d. student 3
Measurements for mass spectrum × retention time
Philip Johan Havemann Jørgensen, Ph.d. student 4
Measurements for mass spectrum × retention time × samples
Philip Johan Havemann Jørgensen, Ph.d. student 5
Tensor Factorization (Parafac2):
Xk = ADkFTk
Key challenge: Determining the correct number of components(Trying to use a probabilistic formulation to solve it)
Philip Johan Havemann Jørgensen, Ph.d. student 6
I Capturing relations in multimodal dataI Data Fusion
I Improving Predictive AnalysisI Transfer Learning/Domain Adaptation
Philip Johan Havemann Jørgensen, Ph.d. student 7
Thank you!
Philip Johan Havemann Jørgensen, Ph.d. student 8
Knowing Nothing
Jeppe Nørregaard
PhD Student with Lars Kai Hansen as supervisor
- Computers and Semantics in Text
Knowing Nothing2 DTU Compute, Technical University of Denmark 29-03-2017
People interact with computers
Where do you
want to go on
holiday?
Knowing Nothing3 DTU Compute, Technical University of Denmark 29-03-2017
People interact with computers
Where do you
want to go on
holiday?
… and other people
Knowing Nothing4 DTU Compute, Technical University of Denmark 29-03-2017
People interact with computers
Where do you
want to go on
holiday?
Doesn’t know what it’s selling
… and other people
Knowing Nothing5 DTU Compute, Technical University of Denmark
Motivations
29-03-2017
Imagine a computer that…
• “knew” Wikipedia
Knowing Nothing6 DTU Compute, Technical University of Denmark 29-03-2017
Knowing Nothing7 DTU Compute, Technical University of Denmark 29-03-2017
Knowing Nothing8 DTU Compute, Technical University of Denmark
Knowing Nothing9 DTU Compute, Technical University of Denmark
Knowing Nothing10 DTU Compute, Technical University of Denmark
Fake News
~3.500 personnel == 3.600 tanks ?
29-03-2017
Knowing Nothing11 DTU Compute, Technical University of Denmark
Motivations
Imagine a computer that…
• “knew” Wikipedia
29-03-2017
Knowing Nothing12 DTU Compute, Technical University of Denmark
Motivations
Imagine a computer that…
• “knew” Wikipedia
• could fact check news
29-03-2017
Knowing Nothing13 DTU Compute, Technical University of Denmark
Motivations
Imagine a computer that…
• “knew” Wikipedia
• could fact check news
• perhaps a little Turing test?
29-03-2017
Knowing Nothing14 DTU Compute, Technical University of Denmark
We are currently working on
Giving computers their own memory
29-03-2017
Knowing Nothing15 DTU Compute, Technical University of Denmark
Exam time!
29-03-2017
All knowledge in the universe
Knowing Nothing16 DTU Compute, Technical University of Denmark
Exam time!
29-03-2017
All knowledge you need
Knowing Nothing17 DTU Compute, Technical University of Denmark
Differentiable Neural Computers[0]
29-03-2017
Write
Read
Memory
We don’t need to touch this
Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory.“ Nature 538.7626 (2016): 471-476.
[0]
Thank You
Jeppe Nørregaard
Automating unsupervised learning
DABAI
Frans Zdyb
Data
Insight
Preprocessing
Domain knowledge
Load into memoryOnline stream
Cluster
Sanitize input
Vector embeddingOutlier detection
Choose a loss functionSpecify labels
Modeling
Formulate priorsTransfer learning
Meta learning
Engineer featuresLearn model parameters
Tune hyperparameters
Build an ensemble
Evaluation
Measure model fit
Measure generalization performance
Measure robustness
Measure scalability
Explanation
Visualisations
Case-based explanations
Report generation
Informed decisions
Machine Learning as a Service
Supervised learning finds predictive relations between variables,
There are systems that do this automatically.
Auto-sklearn1
a wrapper around the scikit-learn, uses
meta-learning, Bayesian optimization and ensemble building
to outperform the state-of-the-art on the ChaLearn AutoML Challenge.
Classification works really well. Regression is coming along nicely.
1 “Efficient and Robust Automated Machine Learning”, Hutter et al., 2015
Unsupervised learning finds generalizable dependencies between variables,
Automating it is largely unexplored territory.
Hypothesis:
● Generalize to unseen data● Robust to different training sets● Detect outliers● Aid in supervised learning
Bayesian Optimization with Gaussian Process
We can use Bayesian Optimization to tune unsupervised models
Python + Numpy + Scipy
TensorFlow for distributed numerical computing and automatic differentiation
Edward2
for probabilistic modeling, built on top of TensorFlowGraphical modelsNeural networksBayesian non-parametrics
Variational InferenceMCMC
GPyOpt3
for Bayesian OptimizationEasy to useParallelUp to date
2 Edward: A library for probabilistic modeling, inference, and criticism, 2016, edwardlib.org3 GPyOpt: A Bayesian Optimization framework in python, 2016, sheffieldml.github.io/GPyOpt/
Thank you!