Scalable machine learning

Scalable Machine LearningA survey of large scale machine learning frameworks.

Arnaud Rachez [email protected]

mailto:[email protected]?subject=

Intro - Cellule de Calcul!

• Who? – Engineers: Arnaud Rachez, Fabian Pedregosa (from Feb. 2015) – Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine)

• What for? – Mutualizing computational needs for partners of the chair. – Centralizing computational expertise for academic projects and

industrial cooperations.

2

Context

3

!

Try to view Big data from the perspective of a machine learning

researcher.

Implementing algorithms at scale in a parallel and distributed

fashion.

Big models trained with online optimisation (eg. deep networks) or sampling (eg. topic models)

Why all the hype

4

Peter Norvig, Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009

[Max Welling, ICML 2014]

Big Data Big Models

http://www.computer.org/portal/web/search/simple?action=authorsearch&resultsPerPage=50&queryOption1=DC_CREATOR&sortOrder=descending&queryText1=Peter%20Norvig

http://www.computer.org/portal/web/search/simple?action=authorsearch&resultsPerPage=50&queryOption1=DC_CREATOR&sortOrder=descending&queryText1=Alon%20Halevy

Outline

5

• Out of core

• Data parallel

• Graph parallel

• Model Parallel

More details

6

J. Gonzalez @ ICML 2014 techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/

Since this is a short talk I’ll go very quickly over a lot of the interesting details of the frameworks. !If you are interested in knowing more and have ~2hours to spare, you should definitely check J. Gonzalez’s talk: !

http://techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/

7

Out of core

Scaling on one machine

Out of core!

• Problem: Training data does not fit in RAM.

• Solution: Lay out data efficiently on disk and load it as needed in memory.

8

Very fast online learning learning. One thread to read, one to train.

Hashing trick, online error, etc.

Parallel matrix multiplication. Bottleneck tends to be

CPU-GPU memory transfer.

Sometimes extends to GPU computing.

Playing with Vowpal Wabbit• Criteo’s Display Advertising Challenge dataset: ~10GB with

~50MM lines

• VW’s logistic regression run on one EC2 instance with an attached EBS volume (3000 reserved IOPS): – cross-entropy = 0.473 in 2’10” (one online pass) – converged to 0.470 in 7 passes (9’4”) !

Pure C++ code. Compiles without problem on linux but latest version has trouble on Mac. Has recently added support for a cluster mode using allreduce.

Does not seem to have support for implementing new algorithms easily.

9

Scalability - A perspective on Big data!

• Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.

• Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks… usually.

Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).

10

11

Data parallel

Statistical query model

Map-Reduce: Statistical query model

12

f, the map function, is sent to every machine

the sum corresponds to a reduce operation

• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.

Statistical query model - Example

13

Gradient of the loss:

• For each (x, y) in dataset apply in parallel (Map step)

• Sum gradients via Reduce • Update w on the master • Repeat until convergence

Map-Reduce

14

• Resilient to failure. HDFS disk replication.

• Can run on huge clusters.

• Makes use of data locality.Programme (query) is moved to the data and not the opposite.

• Map functions must be stateless States are lost between map iterations.

• Computation graph is very constrained by independencies.Not ideal for computation on arbitrary graphs.

Fom Hadoop to Spark

15 Shamelessly stolen from J. Gonzalez presentation


Implemented algorithms MLlib 1.1

16

• linear SVM and logistic regression • classification and regression tree • k-means clustering • recommendation via alternating least squares • singular value decomposition • linear regression with L1- and L2-regularization • multinomial naive Bayes • basic statistics • feature transformations

Playing with Spark• Scala library with Java and Python interfaces. The Python version was not always

responsive.

• Easy installation on both linux and mac. EC2 scripts allow for easy deployment in standalone cluster mode (give instances some additional time to be initialised correctly).

• Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1 version fixes an OutOfMemory error but crashes at the very end of the job.

• Strong scalability for logistic regression was super linear… (probably due to a sub-optimal configuration)

17

18

Graph parallel

Vertex programming

Pregel

The Graph-parallel pattern

19

Model / Alg. State

Computation depends only on the neighbors

Shamelessly stolen from J. Gonzalez presentation ICML ‘14


BSP processing

Synchronous vs Asynchronous

20

[J. Gonzalez Parallel Gibbs Sampling 2010]

Strong Positive Correlation

t=0

Parallel Execution

t=2 t=3

Strong Positive

Correlation

t=1

Sequential

Execution

Strong Negative

Correlation

Asynchronous

ML APPROVED

http://lccc.eecs.berkeley.edu/Slides/Gonzalez10.pdf

Many Graph-Parallel Algorithms• Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization

• Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling

• Semi-supervised ML – Graph SSL – CoEM

• Community Detection

– Triangle-Counting – K-core Decomposition – K-Truss

• Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring

• Classification – Neural Networks

21

Playing with GraphLab• C++ library using MPI for communication.

• Compiles without problem on linux. Works on Mac but a bit more involved (surprisingly since it seems to be developed mainly on Mac.)

• Easy deployment on a cluster. Basic ALS on small Netflix subset works. No logistic regression implemented (it is a graph oriented framework after all).

• Nice API for vertex programming. Would like to try collapsed Gibbs sampling on a larger dataset (Wikipedia?).

• Data input is constrained and preprocessing can be cumbersome (Spark could be used to take care of this part).

22

23

Model parallel

Parameter programming

Big models

24

Data and models do not fit into memory anymore !Deep Learning !

!Neural nets with 10B parameters

PGM!

LDA 1MM words * 1K topics

• Partition data on several machines • Also partition the model !

[J. Gonzalez ICML 2014]

Parameter programming

25

IMO the most ambitious paradigm for large scale ML: 1. asynchronous (for online learning), 2. flexible consistency models (for Hogwild! algorithms)

• With Hadoop/Spark you program on parallel collections. • With GraphLab/Pregel you program on vertices.

ParameterServer lets you program on parameters

Two implementations, both from Carnegie Mellon Universityhttp://parameterserver.org http://petuum.github.io

But it is for VERY large scale problems.

http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

http://parameterserver.org

http://petuum.github.io

Implemented algorithms

26

Very very beta… !• Linear and logistic regression • Neural nets?

Playing with parameter server

27

• Could not make ParameterServer work as of now…

!!!• Petuum compiles easily on linux. • Neural network training works on a randomly

generated dataset on my laptop. • Support for cluster deployment too but I haven’t

tried it yet. Configuration will probably not be easy…

Summary

28

Spark, Graphlab and ParameterServer are complementary frameworks. !• Spark is easy to use, has a well thought API that makes implementing

new models quite easy (as long as they fit the Map-Reduce paradigm). It seems mainly targeted at companies already familiar with the Hadoop stack.

!• GraphLab is designed for use by machine learning researchers. I am

not certain vertex programming is convenient for all types of ML algorithms but it certainly is appealing for MCMC methods.

!• ParameterServer is the framework to rule them all. It is targeted at

very large machine learning and is still at a very early development stage.

29

Thanks

Scalable machine learning

Technology

big data big models

distributed data

big data problems

training data

data weak scaling

machine learning researcher

fast online learning

scalable machine