Top Banner
Performance and scalability for machine learning. Arnaud Rachez ([email protected]) November 2nd, 2015
93

Performance and scalability for machine learning

Jan 09, 2017

Download

Data & Analytics

Arnaud Rachez
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Performance and scalability for machine

    learning.

    Arnaud Rachez ([email protected]) !

    November 2nd, 2015

    mailto:[email protected]

  • Outline

    Performance (7mn)

    Parallelism (7mn)

    Scalability (10mn)

  • Numbers everyone should know (2015 update)

    3

    Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

    Thro

    ugpu

    t GB/

    s

    0

    200

    400

    600

    800

    L1 L2 L3 RAM

    Thro

    ugpu

    t GB/

    s

    0

    7,5

    15

    22,5

    30

    RAM SSD Network~800MB ~1.25GB~30GB

    source: http://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/ source: http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/

    http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.htmlhttp://forums.aida64.com/topic/2864-i7-5775c-l4-cache-performance/http://www.macrumors.com/2015/05/21/15-inch-retina-macbook-pro-2gbps-throughput/

  • Outline

    Performance (5-7mn)

    Parallelism (5-7mn)

    Scalability (7-10mn)

  • Optimising SGD Linear regression (like)

    stochastic gradient descent with d=5 features and n=1,000,000 examples.

    Using Python (1), Numba (2), Numpy (3) and Cython (4) (https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd)

    Also compared it to pure C++ code (https://gist.github.com/zermelozf/4df67d14f72f04b4338a)

    (1)

    (2)(3)

    (4)

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacdhttps://gist.github.com/zermelozf/4df67d14f72f04b4338a

  • Runtime optimisation

    6Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    6

    Optimisation strategies (d=5 & n=1,000,000)

    time

    (ms)

    1

    10

    100

    1000

    10000

    Python Numpy Cython Numba c++

    Source: https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

    Memo

    ry vie

    ws

    Memo

    ry vie

    ws

    point

    ers

    point

    ers?

    Memo

    ry vie

    ws?

    https://gist.github.com/zermelozf/3cd06c8b0ce28f4eeacd

  • Runtime optimisation

    7

  • Runtime optimisation

    7

    Cache optimisation (d=5 & n=1,000,000)

    time

    (ms)

    0

    40

    80

    120

    160

    Numba c++ cython

    random linear

  • Runtime optimisation

    7

    Cache optimisation (d=5 & n=1,000,000)

    time

    (ms)

    0

    40

    80

    120

    160

    Numba c++ cython

    random linear

  • Runtime optimisation

    7

    Cache optimisation (d=5 & n=1,000,000)

    time

    (ms)

    0

    40

    80

    120

    160

    Numba c++ cython

    random linear

    Cache miss

    Cache miss

    Cache miss

  • Runtime optimisation

    7

    Cache optimisation (d=5 & n=1,000,000)

    time

    (ms)

    0

    40

    80

    120

    160

    Numba c++ cython

    random linear

    Cache hit

    Cache hitCache miss

    Cache miss

    Cache miss

    Cache hit

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • (d>>1) Gensim word2vec case study Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    Original C

    Numpy

    Cython

    Cython + BLAS

    Cython + BLAS + sigmoid table

    word/sec (x1000)

    0 30 60 90 120

    point

    ers

    point

    ers

    point

    ers

    http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

  • Whats this BLAS magic?

    Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx

    vectorized y = alpha*x ! replaced 3 lines of code! translated into a 3x speedup over Cython alone! please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

    **On my MacBook Pro, SciPy automatically links against Apples vecLib, which contains an excellent BLAS. Similarly, Intels MKL, AMDs AMCL, Suns SunPerf or the automatically tuned ATLAS are all good choices.

    https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyxhttp://rare-technologies.com/word2vec-in-python-part-two-optimizing/https://developer.apple.com/library/mac/documentation/Performance/Conceptual/vecLib/Reference/reference.htmlhttp://software.intel.com/en-us/intel-mklhttp://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/http://docs.oracle.com/cd/E18659_01/html/821-2763/gjgis.htmlhttp://math-atlas.sourceforge.net/

  • Outline

    Performance (5-7mn)

    Parallelism (5-7mn)

    Scalability (7-10mn)

  • Hardware trends: CPU

    11

    Num

    ber o

    f cor

    es

    0

    1

    2

    3

    4

    Clo

    ck s

    peed

    Mhz

    0

    1000

    2000

    3000

    4000

    1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

    Clock speed (Mhz) #Cores

    Source: http://www.gotw.ca/publications/concurrency-ddj.htm

    http://www.gotw.ca/publications/concurrency-ddj.htm

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    and parallelised with threads!

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    1 thread

    2 threads

    3 threads

    4 threads

    word/sec (x1000)

    0 100 200 300 400

    Original CCython + BLAS + sigmoid table

    and parallelised with threads!

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    1 thread

    2 threads

    3 threads

    4 threads

    word/sec (x1000)

    0 100 200 300 400

    Original CCython + BLAS + sigmoid table

    and parallelised with threads!

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    1 thread

    2 threads

    3 threads

    4 threads

    word/sec (x1000)

    0 100 200 300 400

    Original CCython + BLAS + sigmoid table

    and parallelised with threads!

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    1 thread

    2 threads

    3 threads

    4 threads

    word/sec (x1000)

    0 100 200 300 400

    Original CCython + BLAS + sigmoid table

    and parallelised with threads!

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    1 thread

    2 threads

    3 threads

    4 threads

    word/sec (x1000)

    0 100 200 300 400

    Original CCython + BLAS + sigmoid table

    and parallelised with threads!

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Gensim word2vec continued Elman style RNN trained with

    SGD: 15,079200 matrix on a 1M word corpus.

    Baseline written by Tomas Mikolov in optimised C.

    Rewritten by Radim Rehurec in python.

    Optimised by Radim Rehurec using Cython, BLAS

    Source: http://rare-technologies.com/parallelizing-word2vec-in-python/

    1 thread

    2 threads

    3 threads

    4 threads

    word/sec (x1000)

    0 100 200 300 400

    Original CCython + BLAS + sigmoid table

    and parallelised with threads!

    2.85x speedup

    http://rare-technologies.com/parallelizing-word2vec-in-python/

  • (d>>1) Hogwild!on SAG Fabians experimentation with Julia (lang).

    Running SAG in parallel, without a lock.

  • (d>>1) Hogwild!on SAG Fabians experimentation with Julia (lang).

    Running SAG in parallel, without a lock.

    Very nice speed up!!!

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    Et cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 1

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    Et cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 2

    job 1

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    Et cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 3

    job 1

    job 2

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    Et cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 3 job 4

    job 1

    job 2

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    Et cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 3 job 4 job 5

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    done

    doneEt cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 5

    job 3

    job 4

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    done

    doneEt cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 5

    job 3

    job 4

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    done

    doneEt cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 5

    job 4

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    done

    done

    doneEt cetera

  • Data and does not fit in memory Stream data from disk

    but you cannot read in parallel

    Producer/Consumer pattern

    chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6

    job 4

    job 5

    thread 2"(consumer)

    thread 2"(consumer)

    thread 1"(producer)

    done

    done

    doneEt cetera

  • How many consumers?It depends

    !

    Gensim (R. Rehurec) Saw the impact up to 4 consumers earlier

    Vowpal Wabbit (J. Langford) Claims no gain with more than 1 consumer! 210 on my macbook pro for ~10GB and 50MM lines

    (Criteos advertising dataset). !

    CNNs pre-processing (S. Dieleman) Big impact with ?? (several) consumers! Useful for data augmentation/preprocessing

  • 5.3GB (~105MM lines) word count

    0

    55

    110

    165

    220

    Number of consumers

    1 2 3 4 5 6

    Word count java benchmark

    source: https://gist.github.com/nicomak/1d6561e6f71d936d3178

    Macbook pro 15 2014 `sudo purge`

    https://gist.github.com/nicomak/1d6561e6f71d936d3178

  • Outline

    Performance (5-7mn)

    Parallelism (5-7mn)

    Scalability (7-10mn)

  • Hardware trends: HDD

    Cap

    acity

    (GB)

    0

    150

    300

    450

    600

    Tim

    e to

    read

    (sec

    )

    0

    1000

    2000

    3000

    4000

    1979 1983 1993 1998 1999 2001 2003 2008 2011

    Read full disk (sec.) Capacity (GB)

    Source: https://tylermuth.wordpress.com/2011/11/02

    18

    https://tylermuth.wordpress.com/2011/11/02

  • Distributed computing

    19

    Scalability - A perspective on Big data

  • Distributed computing

    19

    Scalability - A perspective on Big data

  • Distributed computing

    19

    Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.

    Scalability - A perspective on Big data

  • Distributed computing

    19

    Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.

    Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks usually.

    Scalability - A perspective on Big data

  • Distributed computing

    19

    Strong scaling: if you throw twice as many machines at the task, you solve it in half the time. Usually relevant when the task is CPU bound.

    Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time. Memory bound tasks usually.

    Most big data problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling).

    Scalability - A perspective on Big data

  • Bring computation to data

    20

  • Bring computation to data

    20

    Map-Reduce: Statistical query model

  • Bring computation to data

    20

    Map-Reduce: Statistical query model

    the sum corresponds

    to a reduce operation

  • Bring computation to data

    20

    Map-Reduce: Statistical query model

    f, the map function, is

    sent to every machine

    the sum corresponds

    to a reduce operation

  • Bring computation to data

    20

    Map-Reduce: Statistical query model

    f, the map function, is

    sent to every machine

    the sum corresponds

    to a reduce operation

    D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004

    Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS06.

  • Spark on Criteos data!

    Logistic regression trained with minibatch SGD"

    10GB of data (50MM lines). Caveat: Quite small for a benchmark

    Super linear strong scalability. Not theoretically possible => small dataset + few instances saturate. N

    umbe

    r of c

    ores

    0

    10

    20

    30

    40

    time

    in s

    ec.

    0

    325

    650

    975

    1300

    Number of AWS nodes

    4 6 8 10

    time (sec) #cores

  • Spark on Criteos data!

    Logistic regression trained with minibatch SGD"

    10GB of data (50MM lines). Caveat: Quite small for a benchmark

    Super linear strong scalability. Not theoretically possible => small dataset + few instances saturate. N

    umbe

    r of c

    ores

    0

    10

    20

    30

    40

    time

    in s

    ec.

    0

    325

    650

    975

    1300

    Number of AWS nodes

    4 6 8 10

    time (sec) #cores

    Manual setup of the cluster was a bit painful

  • Software stack for big data

    22

  • Software stack for big data

    22

    Local Standalone YARNMESOSCluster"manager

  • Software stack for big data

    22

    Local Standalone YARNMESOS

    HDFS Tachyon Cassandra HBase Others

    Cluster"manager

    Storage "layer

  • Software stack for big data

    22

    Local Standalone YARNMESOS

    HDFS Tachyon Cassandra HBase Others

    Spark"Memory-optimised execution

    engine

    Flink"Apache incubated excution

    engine.

    Hadoop MR 2"

    Cluster"manager

    Storage "layer

    Execution "layer

  • Software stack for big data

    22

    Local Standalone YARNMESOS

    HDFS Tachyon Cassandra HBase Others

    Spark"Memory-optimised execution

    engine

    Flink"Apache incubated excution

    engine.

    Hadoop MR 2"

    MLl

    ib

    Gra

    phX

    Stre

    amin

    g

    SQL/Datafra

    me

    Cluster"manager

    Storage "layer

    Execution "layer

    Libraries

  • Software stack for big data

    22

    Local Standalone YARNMESOS

    HDFS Tachyon Cassandra HBase Others

    Spark"Memory-optimised execution

    engine

    Flink"Apache incubated excution

    engine.

    Hadoop MR 2"

    MLl

    ib

    Gra

    phX

    Stre

    amin

    g

    SQL/Datafra

    me Flin

    kML

    Gel

    ly

    (gra

    ph)

    Tabl

    e AP

    I

    Batc

    h

    Cluster"manager

    Storage "layer

    Execution "layer

    Libraries

  • Software stack MESOS vs YARN

    23

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

    Cluster management frameworks

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

    Cluster management frameworks

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

    Cluster management frameworks

    Concurrent access (multiuser)

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

    Cluster management frameworks

    Concurrent access (multiuser) Hyperparameter tuning (multijob)

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

    Cluster management frameworks

    Concurrent access (multiuser) Hyperparameter tuning (multijob)

  • Software stack MESOS vs YARN

    23

    Standalone mode is fastest but resources are requested for the entire job.

    Cluster management frameworks

    Concurrent access (multiuser) Hyperparameter tuning (multijob)

    Mesos YARN Framework receive offers Easy install on AWS, GCE Lots of compatible frameworks:

    Spark, MPI, Cassandra, HDFS

    Mesospheres DCOS is really, really easy to use.

    Frameworks make offers Configuration hell (can be

    made easier with puppet/ansible recipes

    Several compatible frameworks: Spark, Flink, HDFS

  • Infrastructure stack AWS = AWeSome

    Basic instance with spot price

    !

    !

    !

    !

    Graphical Network Designer

  • Infrastructure stack AWS = AWeSome

    Basic instance with spot price

    !

    !

    !

    !

    Graphical Network Designer

  • Infrastructure stack AWS = AWeSome

    Basic instance with spot price

    !

    !

    !

    !

    Graphical Network Designer

    10 r2.2xlarge instances for (350GB mem. & 40 cores)

    0.85$/hour

  • Infrastructure stack AWS = AWeSome

    Basic instance with spot price

    !

    !

    !

    !

    Graphical Network Designer

  • Infrastructure stack

  • Infrastructure stack

    VPC

  • Infrastructure stack

    VPC

    Subnets public/private

  • Infrastructure stack

    VPC

    Subnets public/private

    Security rules

  • Infrastructure stack

    VPC

    Subnets public/private

    Security rulesBootstrap config for master/slaves

  • Infrastructure stack

    VPC

    Subnets public/private

    Security rulesBootstrap config for master/slaves

    Network entry point

  • Infrastructure stackSource: https://aws.amazon.com/architecture/

    https://aws.amazon.com/architecture/

  • - Questions -

    Thank you

  • Whats coming in the next few years ?

    BONUS