Performance and scalability for machine learning

Performance and scalability for machine

learning.

Arnaud Rachez ([email protected]) !

November 2nd, 2015

mailto:[email protected]

Outline

Performance (7mn)

Parallelism (7mn)

Scalability (10mn)

Numbers everyone should know (2015 update)

3

Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Thro

ugpu

t GB/

s

0

200

400

600

800

L1 L2 L3 RAM

Thro

ugpu

t GB/

s

0

7,5

15

22,5

30

RAM SSD Network~800MB ~1.25GB~30GB

source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/

http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.htmlhttp://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/

Outline

Performance (5-7mn)

Parallelism (5-7mn)

Scalability (7-10mn)

Optimising SGD Linear regression (like)

stochastic gradient descent with d=5 features and n=1,000,000 examples.

Using Python (1), Numba (2), Numpy (3) and Cython (4) (https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd)

Also compared it to pure C++ code (https://gist.github.com/zermelozf/4df67d14f72f04b4338a)

(1)

(2)(3)

(4)

https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacdhttps://gist.github.com/zermelozf/4df67d14f72f04b4338a

Runtime optimisation

6Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd


6

Optimisation strategies (d=5 & n=1,000,000)

time

(ms)

1

10

100

1000

10000

Python Numpy Cython Numba c++

Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd



6

Optimisation strategies (d=5 & n=1,000,000)

time

(ms)

1

10

100

1000

10000

Python Numpy Cython Numba c++

Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

Memo

ry vie

ws

Memo

ry vie

ws

point

ers

point

ers?

Memo

ry vie

ws?



7


7

Cache optimisation (d=5 & n=1,000,000)

time

(ms)

0

40

80

120

160

Numba c++ cython

random linear


7


time

(ms)

0

40

80

120

160

Numba c++ cython

random linear

Cache miss

Cache miss

Cache miss


7


time

(ms)

0

40

80

120

160

Numba c++ cython

random linear

Cache hit

Cache hitCache miss

Cache miss

Cache miss

Cache hit

(d>>1) Gensim word2vec case study Elman style RNN trained with

SGD: 15,079200 matrix on a 1M word corpus.

Baseline written by Tomas Mikolov in optimised C.

Rewritten by Radim Rehurec in python.

Optimised by Radim Rehurec using Cython, BLAS

Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

http://rare-technologies.com/word2vec-in-python-part-two-optimizing/







Original C

Numpy

Cython

Cython + BLAS

Cython + BLAS + sigmoid table

word/sec (x1000)

0 30 60 90 120








Original C

Numpy

Cython

Cython + BLAS

Cython + BLAS + sigmoid table

word/sec (x1000)

0 30 60 90 120

point

ers

point

ers

point

ers


Whats this BLAS magic?

Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx

vectorized y = alpha*x ! replaced 3 lines of code! translated into a 3x speedup over Cython alone! please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

**On my MacBook Pro, SciPy automatically links against Apples vecLib, which contains an excellent BLAS. Similarly, Intels MKL, AMDs AMCL, Suns SunPerf or the automatically tuned ATLAS are all good choices.

https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyxhttp://rare-technologies.com/word2vec-in-python-part-two-optimizing/https://developer.apple.com/library/mac/documentation/Performance/Conceptual/vecLib/Reference/reference.htmlhttp://software.intel.com/en-us/intel-mklhttp://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/http://docs.oracle.com/cd/E18659_01/html/821-2763/gjgis.htmlhttp://math-atlas.sourceforge.net/

Outline

Performance (5-7mn)

Parallelism (5-7mn)


Hardware trends: CPU

11

Num

ber o

f cor

es

0

1

2

3

4

Clo

ck s

peed

Mhz

0

1000

2000

3000

4000

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Clock speed (Mhz) #Cores

Source: http://www.gotw.ca/publications/concurrency-ddj.htm

http://www.gotw.ca/publications/concurrency-ddj.htm

(d>>1) Gensim word2vec continued Elman style RNN trained with





Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

http://rare-technologies.com/parallelizing-word2vec-in-python/







and parallelised with threads!








1 thread

2 threads

3 threads

4 threads

word/sec (x1000)

0 100 200 300 400

Original CCython + BLAS + sigmoid table









1 thread

2 threads

3 threads

4 threads

word/sec (x1000)

0 100 200 300 400

Original CCython + BLAS + sigmoid table


2.85x speedup


(d>>1) Hogwild!on SAG Fabians experimentation with Julia (lang).

Running SAG in parallel, without a lock.

(d>>1) Hogwild!on SAG Fabians experimentation with Julia (lang).

Running SAG in parallel, without a lock.

Very nice speed up!!!

Data and does not fit in memory Stream data from disk

but you cannot read in parallel

Producer/Consumer pattern

chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

Et cetera





job 1

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

Et cetera





job 2

job 1

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

Et cetera





job 3

job 1

job 2

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

Et cetera





job 3 job 4

job 1

job 2

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

Et cetera





job 3 job 4 job 5

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

done

doneEt cetera





job 5

job 3

job 4

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

done

doneEt cetera





job 5

job 4

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

done

done

doneEt cetera





job 4

job 5

thread 2"(consumer)

thread 2"(consumer)

thread 1"(producer)

done

done

doneEt cetera

How many consumers?It depends

!

Gensim (R. Rehurec) Saw the impact up to 4 consumers earlier

Vowpal Wabbit (J. Langford) Claims no gain with more than 1 consumer! 210 on my macbook pro for ~10GB and 50MM lines

(Criteos advertising dataset). !

CNNs pre-processing (S. Dieleman) Big impact with ?? (several) consumers! Useful for data augmentation/preprocessing

5.3GB (~105MM lines) word count

0

55

110

165

220

Number of consumers

1 2 3 4 5 6

Word count java benchmark

source: https://gist.github.com/nicomak/1d6561e6f71d936d3178

Macbook pro 15 2014 `sudo purge`

https://gist.github.com/nicomak/1d6561e6f71d936d3178

Outline

Performance (5-7mn)

Parallelism (5-7mn)


Hardware trends: HDD

Cap

acity

(GB)

0

150

300

450

600

Tim

e to

read

(sec

)

0

1000

2000

3000

4000

1979 1983 1993 1998 1999 2001 2003 2008 2011

Read full disk (sec.) Capacity (GB)

Source: https://tylermuth.wordpress.com/2011/11/02

18

https://tylermuth.wordpress.com/2011/11/02

Distributed computing

19

Scalability - A perspective on Big data


19

Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.



19


Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks usually.



19


Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks usually.

Most big data problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).


Bring computation to data

20


20

Map-Reduce: Statistical query model


20


the sum corresponds

to a reduce operation


20


f, the map function, is

sent to every machine

the sum corresponds



20


f, the map function, is

sent to every machine

the sum corresponds


D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004

Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS06.

Spark on Criteos data!

Logistic regression trained with minibatch SGD"

10GB of data (50MM lines). Caveat: Quite small for a benchmark

Super linear strong scalability. Not theoretically possible => small dataset + few instances saturate. N

umbe

r of c

ores

0

10

20

30

40

time

in s

ec.

0

325

650

975

1300

Number of AWS nodes

4 6 8 10

time (sec) #cores

Spark on Criteos data!

Logistic regression trained with minibatch SGD"

10GB of data (50MM lines). Caveat: Quite small for a benchmark

Super linear strong scalability. Not theoretically possible => small dataset + few instances saturate. N

umbe

r of c

ores

0

10

20

30

40

time

in s

ec.

0

325

650

975

1300

Number of AWS nodes

4 6 8 10

time (sec) #cores

Manual setup of the cluster was a bit painful

Software stack for big data

22


22

Local Standalone YARNMESOSCluster"manager


22

Local Standalone YARNMESOS

HDFS Tachyon Cassandra HBase Others

Cluster"manager

Storage "layer


22



Spark"Memory-optimised execution

engine

Flink"Apache incubated excution

engine.

Hadoop MR 2"

Cluster"manager

Storage "layer

Execution "layer


22




engine


engine.

Hadoop MR 2"

MLl

ib

Gra

phX

Stre

amin

g

SQL/Datafra

me

Cluster"manager

Storage "layer

Execution "layer

Libraries


22




engine


engine.

Hadoop MR 2"

MLl

ib

Gra

phX

Stre

amin

g

SQL/Datafra

me Flin

kML

Gel

ly

(gra

ph)

Tabl

e AP

I

Batc

h

Cluster"manager

Storage "layer

Execution "layer

Libraries

Software stack MESOS vs YARN

23


23

Standalone mode is fastest


23

Standalone mode is fastest but resources are requested for the entire job.


23


Cluster management frameworks


23



Concurrent access (multiuser)


23



Concurrent access (multiuser) Hyperparameter tuning (multijob)


23



Concurrent access (multiuser) Hyperparameter tuning (multijob)

Mesos YARN Framework receive offers Easy install on AWS, GCE Lots of compatible frameworks:

Spark, MPI, Cassandra, HDFS

Mesospheres DCOS is really, really easy to use.

Frameworks make offers Configuration hell (can be

made easier with puppet/ansible recipes

Several compatible frameworks: Spark, Flink, HDFS

Infrastructure stack AWS = AWeSome

Basic instance with spot price

!

!

!

!

Graphical Network Designer



!

!

!

!


10 r2.2xlarge instances for (350GB mem. & 40 cores)

0.85$/hour



!

!

!

!


Infrastructure stack


VPC


VPC

Subnets public/private


VPC


Security rules


VPC


Security rulesBootstrap config for master/slaves


VPC


Security rulesBootstrap config for master/slaves

Network entry point

Infrastructure stackSource: https://aws.amazon.com/architecture/

https://aws.amazon.com/architecture/

- Questions -

Thank you

Whats coming in the next few years ?

BONUS

Performance and scalability for machine learning

Data & Analytics