Top Banner
Scalable Machine Learning A survey of large scale machine learning frameworks. Arnaud Rachez [email protected]
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable machine learning

Scalable Machine LearningA survey of large scale machine learning frameworks.

Arnaud Rachez [email protected]

Page 2: Scalable machine learning

Intro - Cellule de Calcul!

• Who? – Engineers: Arnaud Rachez, Fabian Pedregosa (from Feb. 2015) – Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine)

• What for? – Mutualizing computational needs for partners of the chair. – Centralizing computational expertise for academic projects and

industrial cooperations.

2

Page 3: Scalable machine learning

Context

3

!

Try to view Big data from the perspective of a machine learning

researcher.

Implementing algorithms at scale in a parallel and distributed

fashion.

Big models trained with online optimisation (eg. deep networks) or sampling (eg. topic models)

Page 5: Scalable machine learning

Outline

5

• Out of core

• Data parallel

• Graph parallel

• Model Parallel

Page 6: Scalable machine learning

More details

6

J. Gonzalez @ ICML 2014 techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/

Since this is a short talk I’ll go very quickly over a lot of the interesting details of the frameworks. !If you are interested in knowing more and have ~2hours to spare, you should definitely check J. Gonzalez’s talk: !

Page 7: Scalable machine learning

7

Out of core

Scaling on one machine

Page 8: Scalable machine learning

Out of core!

• Problem: Training data does not fit in RAM.

• Solution: Lay out data efficiently on disk and load it as needed in memory.

8

Very fast online learning learning. One thread to read, one to train.

Hashing trick, online error, etc.

Parallel matrix multiplication. Bottleneck tends to be

CPU-GPU memory transfer.

Sometimes extends to GPU computing.

Page 9: Scalable machine learning

Playing with Vowpal Wabbit• Criteo’s Display Advertising Challenge dataset: ~10GB with

~50MM lines

• VW’s logistic regression run on one EC2 instance with an attached EBS volume (3000 reserved IOPS): – cross-entropy = 0.473 in 2’10” (one online pass) – converged to 0.470 in 7 passes (9’4”) !

Pure C++ code. Compiles without problem on linux but latest version has trouble on Mac. Has recently added support for a cluster mode using allreduce.

Does not seem to have support for implementing new algorithms easily.

9

Page 10: Scalable machine learning

Scalability - A perspective on Big data!

• Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.

• Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks… usually.

Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).

10

Page 11: Scalable machine learning

11

Data parallel

Statistical query model

Page 12: Scalable machine learning

Map-Reduce: Statistical query model

12

f, the map function, is sent to every machine

the sum corresponds to a reduce operation

• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.

Page 13: Scalable machine learning

Statistical query model - Example

13

Gradient of the loss:

• For each (x, y) in dataset apply in parallel (Map step)

• Sum gradients via Reduce • Update w on the master • Repeat until convergence

Page 14: Scalable machine learning

Map-Reduce

14

• Resilient to failure. HDFS disk replication.

• Can run on huge clusters.

• Makes use of data locality.Programme (query) is moved to the data and not the opposite.

• Map functions must be stateless States are lost between map iterations.

• Computation graph is very constrained by independencies.Not ideal for computation on arbitrary graphs.

Page 15: Scalable machine learning

Fom Hadoop to Spark

15 Shamelessly stolen from J. Gonzalez presentation

Page 16: Scalable machine learning

Implemented algorithms MLlib 1.1

16

• linear SVM and logistic regression • classification and regression tree • k-means clustering • recommendation via alternating least squares • singular value decomposition • linear regression with L1- and L2-regularization • multinomial naive Bayes • basic statistics • feature transformations

Page 17: Scalable machine learning

Playing with Spark• Scala library with Java and Python interfaces. The Python version was not always

responsive.

• Easy installation on both linux and mac. EC2 scripts allow for easy deployment in standalone cluster mode (give instances some additional time to be initialised correctly).

• Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1 version fixes an OutOfMemory error but crashes at the very end of the job.

• Strong scalability for logistic regression was super linear… (probably due to a sub-optimal configuration)

17

Page 18: Scalable machine learning

18

Graph parallel

Vertex programming

Pregel

Page 19: Scalable machine learning

The Graph-parallel pattern

19

Model / Alg. State

Computation depends only on the neighbors

Shamelessly stolen from J. Gonzalez presentation ICML ‘14

Page 20: Scalable machine learning

BSP processing

Synchronous vs Asynchronous

20

[J. Gonzalez Parallel Gibbs Sampling 2010]

Strong Positive Correlation

t=0

Parallel Execution

t=2 t=3

Strong Positive

Correlation

t=1

Sequential

Execution

Strong Negative

Correlation

Asynchronous

ML APPROVED

Page 21: Scalable machine learning

Many Graph-Parallel Algorithms• Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization

• Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling

• Semi-supervised ML – Graph SSL – CoEM

• Community Detection

– Triangle-Counting – K-core Decomposition – K-Truss

• Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring

• Classification – Neural Networks

21

Page 22: Scalable machine learning

Playing with GraphLab• C++ library using MPI for communication.

• Compiles without problem on linux. Works on Mac but a bit more involved (surprisingly since it seems to be developed mainly on Mac.)

• Easy deployment on a cluster. Basic ALS on small Netflix subset works. No logistic regression implemented (it is a graph oriented framework after all).

• Nice API for vertex programming. Would like to try collapsed Gibbs sampling on a larger dataset (Wikipedia?).

• Data input is constrained and preprocessing can be cumbersome (Spark could be used to take care of this part).

22

Page 23: Scalable machine learning

23

Model parallel

Parameter programming

Page 24: Scalable machine learning

Big models

24

Data and models do not fit into memory anymore !Deep Learning !

!Neural nets with 10B parameters

PGM!

LDA 1MM words * 1K topics

• Partition data on several machines • Also partition the model !

[J. Gonzalez ICML 2014]

Page 25: Scalable machine learning

Parameter programming

25

IMO the most ambitious paradigm for large scale ML: 1. asynchronous (for online learning), 2. flexible consistency models (for Hogwild! algorithms)

• With Hadoop/Spark you program on parallel collections. • With GraphLab/Pregel you program on vertices.

ParameterServer lets you program on parameters

Two implementations, both from Carnegie Mellon Universityhttp://parameterserver.org http://petuum.github.io

But it is for VERY large scale problems.

Page 26: Scalable machine learning

Implemented algorithms

26

Very very beta… !• Linear and logistic regression • Neural nets?

Page 27: Scalable machine learning

Playing with parameter server

27

• Could not make ParameterServer work as of now…

!!!• Petuum compiles easily on linux. • Neural network training works on a randomly

generated dataset on my laptop. • Support for cluster deployment too but I haven’t

tried it yet. Configuration will probably not be easy…

Page 28: Scalable machine learning

Summary

28

Spark, Graphlab and ParameterServer are complementary frameworks. !• Spark is easy to use, has a well thought API that makes implementing

new models quite easy (as long as they fit the Map-Reduce paradigm). It seems mainly targeted at companies already familiar with the Hadoop stack.

!• GraphLab is designed for use by machine learning researchers. I am

not certain vertex programming is convenient for all types of ML algorithms but it certainly is appealing for MCMC methods.

!• ParameterServer is the framework to rule them all. It is targeted at

very large machine learning and is still at a very early development stage.

Page 29: Scalable machine learning

29

Thanks