Scalable Machine Learning A survey of large scale machine learning frameworks. Arnaud Rachez [email protected]
Jul 28, 2015
Scalable Machine LearningA survey of large scale machine learning frameworks.
Arnaud Rachez [email protected]
Intro - Cellule de Calcul!
• Who? – Engineers: Arnaud Rachez, Fabian Pedregosa (from Feb. 2015) – Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine)
• What for? – Mutualizing computational needs for partners of the chair. – Centralizing computational expertise for academic projects and
industrial cooperations.
2
Context
3
!
Try to view Big data from the perspective of a machine learning
researcher.
Implementing algorithms at scale in a parallel and distributed
fashion.
Big models trained with online optimisation (eg. deep networks) or sampling (eg. topic models)
Why all the hype
4
Peter Norvig, Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009
[Max Welling, ICML 2014]
Big Data Big Models
Outline
5
• Out of core
• Data parallel
• Graph parallel
• Model Parallel
More details
6
J. Gonzalez @ ICML 2014 techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/
Since this is a short talk I’ll go very quickly over a lot of the interesting details of the frameworks. !If you are interested in knowing more and have ~2hours to spare, you should definitely check J. Gonzalez’s talk: !
7
Out of core
Scaling on one machine
Out of core!
• Problem: Training data does not fit in RAM.
• Solution: Lay out data efficiently on disk and load it as needed in memory.
8
Very fast online learning learning. One thread to read, one to train.
Hashing trick, online error, etc.
Parallel matrix multiplication. Bottleneck tends to be
CPU-GPU memory transfer.
Sometimes extends to GPU computing.
Playing with Vowpal Wabbit• Criteo’s Display Advertising Challenge dataset: ~10GB with
~50MM lines
• VW’s logistic regression run on one EC2 instance with an attached EBS volume (3000 reserved IOPS): – cross-entropy = 0.473 in 2’10” (one online pass) – converged to 0.470 in 7 passes (9’4”) !
Pure C++ code. Compiles without problem on linux but latest version has trouble on Mac. Has recently added support for a cluster mode using allreduce.
Does not seem to have support for implementing new algorithms easily.
9
Scalability - A perspective on Big data!
• Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).
10
11
Data parallel
Statistical query model
Map-Reduce: Statistical query model
12
f, the map function, is sent to every machine
the sum corresponds to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004
• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
Statistical query model - Example
13
Gradient of the loss:
• For each (x, y) in dataset apply in parallel (Map step)
• Sum gradients via Reduce • Update w on the master • Repeat until convergence
Map-Reduce
14
• Resilient to failure. HDFS disk replication.
• Can run on huge clusters.
• Makes use of data locality.Programme (query) is moved to the data and not the opposite.
• Map functions must be stateless States are lost between map iterations.
• Computation graph is very constrained by independencies.Not ideal for computation on arbitrary graphs.
Fom Hadoop to Spark
15 Shamelessly stolen from J. Gonzalez presentation
Implemented algorithms MLlib 1.1
16
• linear SVM and logistic regression • classification and regression tree • k-means clustering • recommendation via alternating least squares • singular value decomposition • linear regression with L1- and L2-regularization • multinomial naive Bayes • basic statistics • feature transformations
Playing with Spark• Scala library with Java and Python interfaces. The Python version was not always
responsive.
• Easy installation on both linux and mac. EC2 scripts allow for easy deployment in standalone cluster mode (give instances some additional time to be initialised correctly).
• Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1 version fixes an OutOfMemory error but crashes at the very end of the job.
• Strong scalability for logistic regression was super linear… (probably due to a sub-optimal configuration)
17
18
Graph parallel
Vertex programming
Pregel
The Graph-parallel pattern
19
Model / Alg. State
Computation depends only on the neighbors
Shamelessly stolen from J. Gonzalez presentation ICML ‘14
BSP processing
Synchronous vs Asynchronous
20
[J. Gonzalez Parallel Gibbs Sampling 2010]
Strong Positive Correlation
t=0
Parallel Execution
t=2 t=3
Strong Positive
Correlation
t=1
Sequential
Execution
Strong Negative
Correlation
Asynchronous
ML APPROVED
Many Graph-Parallel Algorithms• Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization
• Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling
• Semi-supervised ML – Graph SSL – CoEM
• Community Detection
– Triangle-Counting – K-core Decomposition – K-Truss
• Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring
• Classification – Neural Networks
21
Playing with GraphLab• C++ library using MPI for communication.
• Compiles without problem on linux. Works on Mac but a bit more involved (surprisingly since it seems to be developed mainly on Mac.)
• Easy deployment on a cluster. Basic ALS on small Netflix subset works. No logistic regression implemented (it is a graph oriented framework after all).
• Nice API for vertex programming. Would like to try collapsed Gibbs sampling on a larger dataset (Wikipedia?).
• Data input is constrained and preprocessing can be cumbersome (Spark could be used to take care of this part).
22
23
Model parallel
Parameter programming
Big models
24
Data and models do not fit into memory anymore !Deep Learning !
!Neural nets with 10B parameters
PGM!
LDA 1MM words * 1K topics
• Partition data on several machines • Also partition the model !
[J. Gonzalez ICML 2014]
Parameter programming
25
IMO the most ambitious paradigm for large scale ML: 1. asynchronous (for online learning), 2. flexible consistency models (for Hogwild! algorithms)
• With Hadoop/Spark you program on parallel collections. • With GraphLab/Pregel you program on vertices.
ParameterServer lets you program on parameters
Two implementations, both from Carnegie Mellon Universityhttp://parameterserver.org http://petuum.github.io
But it is for VERY large scale problems.
Implemented algorithms
26
Very very beta… !• Linear and logistic regression • Neural nets?
Playing with parameter server
27
• Could not make ParameterServer work as of now…
!!!• Petuum compiles easily on linux. • Neural network training works on a randomly
generated dataset on my laptop. • Support for cluster deployment too but I haven’t
tried it yet. Configuration will probably not be easy…
Summary
28
Spark, Graphlab and ParameterServer are complementary frameworks. !• Spark is easy to use, has a well thought API that makes implementing
new models quite easy (as long as they fit the Map-Reduce paradigm). It seems mainly targeted at companies already familiar with the Hadoop stack.
!• GraphLab is designed for use by machine learning researchers. I am
not certain vertex programming is convenient for all types of ML algorithms but it certainly is appealing for MCMC methods.
!• ParameterServer is the framework to rule them all. It is targeted at
very large machine learning and is still at a very early development stage.
29
Thanks