Performance and scalability for machine learning. Arnaud Rachez ([email protected]) November 2nd, 2015
Performance and scalability for machine
learning.
Arnaud Rachez ([email protected]) !
November 2nd, 2015
mailto:[email protected]
Outline
Performance (7mn)
Parallelism (7mn)
Scalability (10mn)
Numbers everyone should know (2015 update)
3
Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Thro
ugpu
t GB/
s
0
200
400
600
800
L1 L2 L3 RAM
Thro
ugpu
t GB/
s
0
7,5
15
22,5
30
RAM SSD Network~800MB ~1.25GB~30GB
source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.htmlhttp://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/
Outline
Performance (5-7mn)
Parallelism (5-7mn)
Scalability (7-10mn)
Optimising SGD Linear regression (like)
stochastic gradient descent with d=5 features and n=1,000,000 examples.
Using Python (1), Numba (2), Numpy (3) and Cython (4) (https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd)
Also compared it to pure C++ code (https://gist.github.com/zermelozf/4df67d14f72f04b4338a)
(1)
(2)(3)
(4)
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacdhttps://gist.github.com/zermelozf/4df67d14f72f04b4338a
Runtime optimisation
6Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
6
Optimisation strategies (d=5 & n=1,000,000)
time
(ms)
1
10
100
1000
10000
Python Numpy Cython Numba c++
Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Memo
ry vie
ws
Memo
ry vie
ws
point
ers
point
ers?
Memo
ry vie
ws?
https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd
Runtime optimisation
7
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time
(ms)
0
40
80
120
160
Numba c++ cython
random linear
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time
(ms)
0
40
80
120
160
Numba c++ cython
random linear
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time
(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache miss
Cache miss
Cache miss
Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time
(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache hit
Cache hitCache miss
Cache miss
Cache miss
Cache hit
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
(d>>1) Gensim word2vec case study Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
point
ers
point
ers
point
ers
http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Whats this BLAS magic?
Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx
vectorized y = alpha*x ! replaced 3 lines of code! translated into a 3x speedup over Cython alone! please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
**On my MacBook Pro, SciPy automatically links against Apples vecLib, which contains an excellent BLAS. Similarly, Intels MKL, AMDs AMCL, Suns SunPerf or the automatically tuned ATLAS are all good choices.
https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyxhttp://rare-technologies.com/word2vec-in-python-part-two-optimizing/https://developer.apple.com/library/mac/documentation/Performance/Conceptual/vecLib/Reference/reference.htmlhttp://software.intel.com/en-us/intel-mklhttp://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/http://docs.oracle.com/cd/E18659_01/html/821-2763/gjgis.htmlhttp://math-atlas.sourceforge.net/
Outline
Performance (5-7mn)
Parallelism (5-7mn)
Scalability (7-10mn)
Hardware trends: CPU
11
Num
ber o
f cor
es
0
1
2
3
4
Clo
ck s
peed
Mhz
0
1000
2000
3000
4000
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Clock speed (Mhz) #Cores
Source: http://www.gotw.ca/publications/concurrency-ddj.htm
http://www.gotw.ca/publications/concurrency-ddj.htm
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
and parallelised with threads!
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original CCython + BLAS + sigmoid table
and parallelised with threads!
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original CCython + BLAS + sigmoid table
and parallelised with threads!
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original CCython + BLAS + sigmoid table
and parallelised with threads!
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original CCython + BLAS + sigmoid table
and parallelised with threads!
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original CCython + BLAS + sigmoid table
and parallelised with threads!
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Gensim word2vec continued Elman style RNN trained with
SGD: 15,079200 matrix on a 1M word corpus.
Baseline written by Tomas Mikolov in optimised C.
Rewritten by Radim Rehurec in python.
Optimised by Radim Rehurec using Cython, BLAS
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original CCython + BLAS + sigmoid table
and parallelised with threads!
2.85x speedup
http://rare-technologies.com/parallelizing-word2vec-in-python/
(d>>1) Hogwild!on SAG Fabians experimentation with Julia (lang).
Running SAG in parallel, without a lock.
(d>>1) Hogwild!on SAG Fabians experimentation with Julia (lang).
Running SAG in parallel, without a lock.
Very nice speed up!!!
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
Et cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 1
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
Et cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 2
job 1
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
Et cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 3
job 1
job 2
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
Et cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 3 job 4
job 1
job 2
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
Et cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 3 job 4 job 5
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
done
doneEt cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 5
job 3
job 4
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
done
doneEt cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 5
job 3
job 4
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
done
doneEt cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 5
job 4
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
done
done
doneEt cetera
Data and does not fit in memory Stream data from disk
but you cannot read in parallel
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6
job 4
job 5
thread 2"(consumer)
thread 2"(consumer)
thread 1"(producer)
done
done
doneEt cetera
How many consumers?It depends
!
Gensim (R. Rehurec) Saw the impact up to 4 consumers earlier
Vowpal Wabbit (J. Langford) Claims no gain with more than 1 consumer! 210 on my macbook pro for ~10GB and 50MM lines
(Criteos advertising dataset). !
CNNs pre-processing (S. Dieleman) Big impact with ?? (several) consumers! Useful for data augmentation/preprocessing
5.3GB (~105MM lines) word count
0
55
110
165
220
Number of consumers
1 2 3 4 5 6
Word count java benchmark
source: https://gist.github.com/nicomak/1d6561e6f71d936d3178
Macbook pro 15 2014 `sudo purge`
https://gist.github.com/nicomak/1d6561e6f71d936d3178
Outline
Performance (5-7mn)
Parallelism (5-7mn)
Scalability (7-10mn)
Hardware trends: HDD
Cap
acity
(GB)
0
150
300
450
600
Tim
e to
read
(sec
)
0
1000
2000
3000
4000
1979 1983 1993 1998 1999 2001 2003 2008 2011
Read full disk (sec.) Capacity (GB)
Source: https://tylermuth.wordpress.com/2011/11/02
18
https://tylermuth.wordpress.com/2011/11/02
Distributed computing
19
Scalability - A perspective on Big data
Distributed computing
19
Scalability - A perspective on Big data
Distributed computing
19
Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.
Scalability - A perspective on Big data
Distributed computing
19
Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.
Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks usually.
Scalability - A perspective on Big data
Distributed computing
19
Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.
Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks usually.
Most big data problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).
Scalability - A perspective on Big data
Bring computation to data
20
Bring computation to data
20
Map-Reduce: Statistical query model
Bring computation to data
20
Map-Reduce: Statistical query model
the sum corresponds
to a reduce operation
Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004
Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS06.
Spark on Criteos data!
Logistic regression trained with minibatch SGD"
10GB of data (50MM lines). Caveat: Quite small for a benchmark
Super linear strong scalability. Not theoretically possible => small dataset + few instances saturate. N
umbe
r of c
ores
0
10
20
30
40
time
in s
ec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Spark on Criteos data!
Logistic regression trained with minibatch SGD"
10GB of data (50MM lines). Caveat: Quite small for a benchmark
Super linear strong scalability. Not theoretically possible => small dataset + few instances saturate. N
umbe
r of c
ores
0
10
20
30
40
time
in s
ec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Manual setup of the cluster was a bit painful
Software stack for big data
22
Software stack for big data
22
Local Standalone YARNMESOSCluster"manager
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others
Cluster"manager
Storage "layer
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others
Spark"Memory-optimised execution
engine
Flink"Apache incubated excution
engine.
Hadoop MR 2"
Cluster"manager
Storage "layer
Execution "layer
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others
Spark"Memory-optimised execution
engine
Flink"Apache incubated excution
engine.
Hadoop MR 2"
MLl
ib
Gra
phX
Stre
amin
g
SQL/Datafra
me
Cluster"manager
Storage "layer
Execution "layer
Libraries
Software stack for big data
22
Local Standalone YARNMESOS
HDFS Tachyon Cassandra HBase Others
Spark"Memory-optimised execution
engine
Flink"Apache incubated excution
engine.
Hadoop MR 2"
MLl
ib
Gra
phX
Stre
amin
g
SQL/Datafra
me Flin
kML
Gel
ly
(gra
ph)
Tabl
e AP
I
Batc
h
Cluster"manager
Storage "layer
Execution "layer
Libraries
Software stack MESOS vs YARN
23
Software stack MESOS vs YARN
23
Standalone mode is fastest
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Cluster management frameworks
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Cluster management frameworks
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Cluster management frameworks
Concurrent access (multiuser)
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Cluster management frameworks
Concurrent access (multiuser) Hyperparameter tuning (multijob)
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Cluster management frameworks
Concurrent access (multiuser) Hyperparameter tuning (multijob)
Software stack MESOS vs YARN
23
Standalone mode is fastest but resources are requested for the entire job.
Cluster management frameworks
Concurrent access (multiuser) Hyperparameter tuning (multijob)
Mesos YARN Framework receive offers Easy install on AWS, GCE Lots of compatible frameworks:
Spark, MPI, Cassandra, HDFS
Mesospheres DCOS is really, really easy to use.
Frameworks make offers Configuration hell (can be
made easier with puppet/ansible recipes
Several compatible frameworks: Spark, Flink, HDFS
Infrastructure stack AWS = AWeSome
Basic instance with spot price
!
!
!
!
Graphical Network Designer
Infrastructure stack AWS = AWeSome
Basic instance with spot price
!
!
!
!
Graphical Network Designer
Infrastructure stack AWS = AWeSome
Basic instance with spot price
!
!
!
!
Graphical Network Designer
10 r2.2xlarge instances for (350GB mem. & 40 cores)
0.85$/hour
Infrastructure stack AWS = AWeSome
Basic instance with spot price
!
!
!
!
Graphical Network Designer
Infrastructure stack
Infrastructure stack
VPC
Infrastructure stack
VPC
Subnets public/private
Infrastructure stack
VPC
Subnets public/private
Security rules
Infrastructure stack
VPC
Subnets public/private
Security rulesBootstrap config for master/slaves
Infrastructure stack
VPC
Subnets public/private
Security rulesBootstrap config for master/slaves
Network entry point
Infrastructure stackSource: https://aws.amazon.com/architecture/
https://aws.amazon.com/architecture/
- Questions -
Thank you
Whats coming in the next few years ?
BONUS