Top Banner
Environment for training models Dmitry Spodarets AI Rush
44

Environment for training models

Jan 22, 2018

Download

Technology

FlyElephant
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Environment for training models

Environment for training modelsDmitry Spodarets

AI Rush

Page 2: Environment for training models

Who am I

Dmitry Spodarets• Founder and CEO at FlyElephant

• PhD candidate at Odessa National University

• Lecturer at Odessa Polytechnic University • Organizer of technical conferences about AI,

BigData, HPC, JS, Web Technologies …

Page 3: Environment for training models

Agenda

•Data Science Tools Survey Results• Computing resources• Clouds (AWS & Azure)• Containers (Docker, Singularity)• FlyElephant platform for Data Science

Page 4: Environment for training models

Data Science Tools Survey

220datascientist

Page 5: Environment for training models

Datasets

0

10

20

30

40

50

60

70

lessthan1MB

1.1to10MB

11to100MB

101MBto1GB

1.1to10GB

11to100GB

101GBto1

Terabyte

1.1to10TB

11to100TB

101TBto1

Petabyte

1.1PBto10

Petabyte

11to100PB

over100PB

Datasets

Datasets

Page 6: Environment for training models

Tools for collecting data

Python 45

R 26

Spark 18

SQL 15

Excel 13

Kafka 11

Pandas 10

custom 8

Hadoop 5

Numpy 5

SAS 5

Page 7: Environment for training models

Tools for storing data

PostgreSQL 37

CSV 31

MySQL 21

Hadoop 16

Excel 15

HDFS 15

Mongodb 15

MyServer 12

Oracle 11

Hive 8

Page 8: Environment for training models

Programming languages

Python 151

R 88

SQL 37

Java 32

Scala 22

bash 17

C++ 17

JavaScript 15

C# 13

vba 8

C 6

Page 9: Environment for training models

Libraries

Pandas 88

Numpy 68

scikit-learn 48

scipy 26

dplyr 20

matplotlib 20

ggplot2 15

keras 14

SPARK 13

xgboost 13

Tensorflow 12

Page 10: Environment for training models

Tools for the visualization of data

matplotlib 66

ggplot 40

seaborn 33

Excel 22

Tableau 22

R 19

plotly 13

bokeh 12

d3 11

Page 11: Environment for training models

Clouds

aws 77

none 41

azure 25

google 24

digital ocean 9

OpenStack 7

Watson 1

Page 12: Environment for training models

The Jupyter Notebook

Page 13: Environment for training models

Jupyter Lab

Page 14: Environment for training models

Computing resources

Page 15: Environment for training models

Computing resources

Page 16: Environment for training models

Computing resources

NVIDIADGX-1Deep Learning Supercomputer170/3TFLOPS(GPUFP16/CPUFP32)

nvidia tesla p100~5 TeraFLOPS

~3TeraFLOPS

Page 17: Environment for training models

Image Training Performance on GoogLeNet

251,77425,38

569,1467,73

791,96

1230,63

0

200

400

600

800

1000

1200

1400

1GPU(1.86X) 2GPUs(1.87X) 4GPUs(2.2X)

TeslaK80 TeslaP100

http://www.nvidia.com/object/caffe-benchmarks.html

images

traine

dpe

rsecon

d

Page 18: Environment for training models

1080 vs Titan X vs K80 vs P100

0,25

8,8

0,3

10,1

2,9

8,7

5,3

10,6

0

2

4

6

8

10

12

FP32(Singleprecision) FP64(Doubleprecision)

1080 TitanX K80 P100

http://www.nvidia.com/

TFLO

PS

Page 19: Environment for training models

Problem

Effective parallelization of algorithms

Page 20: Environment for training models

NVIDIA Deep Learning SDK

Page 21: Environment for training models

Computing power (Intel)

• Intel Math Kernel Library (Intel MKL)Natively supports C, C++ and Fortran Development. Cross-language compatible with Java, C#, Python and other languages.

• Intel Data Analytics Acceleration Library (Intel DAAL)Includes Python, C++, and Java APIs and connectors to popular data sources including Spark and Hadoop.

• Intel MPI LibraryNatively supports C,C++ and Fortran development

Page 22: Environment for training models

Books

Page 23: Environment for training models

Clouds

Page 24: Environment for training models

Clouds

P2-series N-series 16X K80 4X K80X1-series H-Series

128 vCPU / 1952 GB 16 vCPU / 224 GBC4-series

36 vCPU / 60 GBaws.amazon.com/marketplace/ azuremarketplace.microsoft.com

Page 25: Environment for training models

Azure CLI

1. sudo pip install azure-cli2. az login3. az group create --name GroupName --location EastUS4. az vm create --resource-group GroupName --name MyVM --image

Canonical:UbuntuServer:16.04-LTS:latest --size Standard_NC6 --storage-sku Standard_LRS --admin-username user --ssh-key-value ~/.ssh/id_rsa.pub

5. az vm deallocate --resource-group GroupName --name MyVM6. az vm start --resource-group GroupName --name MyVM7. az vm list-ip-addresses --resource-group GroupName --name

MyVM8. az vm delete --resource-group GroupName --name MyVM9. az group delete --name GroupName

Page 26: Environment for training models

Data Science images in Azure Marketplace

Page 27: Environment for training models

Data Science images in AWS Marketplace

Page 28: Environment for training models

Containers

Page 29: Environment for training models

Docker

Page 30: Environment for training models

Docker (Dockerfile)

FROM gcr.io/tensorflow/tensorflow

MAINTAINER Dmitry Spodarets <[email protected]>

RUN apt update && apt -y upgrade && apt -y install git curl wget

CMD /run_jupyter.sh

Page 31: Environment for training models

Docker (build.sh)

#!/bin/bashfunction docker_build {

docker build -t $1 ./$1; docker tag $1 registry.flyelephant.net/$1 docker push registry.flyelephant.net/$1 docker rmi $1 registry.flyelephant.net/$1

}case $1 in all)

for i in `cat build.list`; do docker_build $i;

done ;;

*) docker_build $1;;

esac

Page 32: Environment for training models

Docker Hub

https://hub.docker.com/

Page 33: Environment for training models

Docker

1. docker images

2. docker run --memory 512m --cpus="2" --name mycont registry.flyelephant.net/tensorflow

3. docker exec -i -t mycont bash

4. docker ps

5. docker stats

6. docker stop CONTAINER ID

7. docker start CONTAINER ID

8. docker rm CONTAINER ID

Page 34: Environment for training models

Docker Machine

• Amazon Web Services

• Digital Ocean

• Exoscale

• Generic

• Google Compute Engine

• IBM Softlayer

• Microsoft Azure

• Microsoft Hyper-V

• OpenStack

• Oracle

• VirtualBox

• Rackspace

• VMware Fusion

• VMware v

• Cloud Air

• VMware vSphere

docker-machine create --driver azure --azure-subscription-id subscription-id --azure-resource-group resourcename --azure-ssh-user user --azure-size machine-name

docker-machine ssh machine-name

Page 35: Environment for training models

Singularity

Page 36: Environment for training models

Singularity - Containers for Science

• First public release in April 2016, followed by a massive uptake•HPC Wire Editor’s choice: Top Technologies to Watch for 2017• Simple integration with resource managers, InfiniBand, GPUs, MPI, file

systems, and supports multiple architectures (x86_64, PPC, ARM, etc..)• Limits user’s privileges (inside user == outside user)•No root owned container daemon•Network images are supported via URIs and all require local caching:

○ docker:// - This will pull a container from Docker Hub

○ http://, https:// - This will pull an image or tarball from the URL, cache and run it

○ shub:// - Pull an image from the Singularity Hub

Page 37: Environment for training models

Singularity - Usage Examples

$ python ./hello.pyHello World: The Python version is 2.7.5$ sudo singularity exec --writable /tmp/debian.img apt-get install python…$ singularity exec /tmp/debian.img python ./hello.pyHello World: The Python version is 2.7.12

Webinar"IntroductiontoSingularity"https://youtu.be/h5rDnCA3NJA

Page 38: Environment for training models

Contributors to Singularity

Page 39: Environment for training models

Network Based Computing LabOhio State University

• High-Performance Big Data (HiBD)http://hibd.cse.ohio-state.edu/

• High-Performance Deep Learning (HiDL)http://hidl.cse.ohio-state.edu/

Page 40: Environment for training models

FlyElephant

Page 41: Environment for training models

FlyElephant platform for Data Science

We automate Data Scienceand help teams to work efficiently.

Computing resources

Ready-computing infrastructure

Collaboration& Sharing

Fast Deployment

Expert Community

Page 42: Environment for training models

Ready-computing infrastructure

Jupyter orother IDE

Automatic running of tasks

Server orCluster

Page 43: Environment for training models

Our resources

• Public Clouds: Azure & AWS.• Private cloud based on OpenStack.• HPC-clusters based on SLURM.• Docker-clusters based on Swarm / Singularity.

• Tools and languages: R, Python, Java, Scala, C/C++, Julia, OpenFOAM, Octave, PyFR,

Scilab, GROMACS, MATLAB, Intel MKL, FlowVision, ANSYS, COMSOL, AVL, Hadoop, Spark, H2O, Anaconda, Octave, scikit-learn, Tensorflow, Theano, Caffe, etc.

FlyElephant US 1 Cloud (P100, K80, Titan X, FPGA (Xilinx))

• HPC HUB 1: 80 nodes (2 × Xeon E5-2680v2 (20 cores), 64GB RAM, IB FDR) and 240TB storage.• HPC HUB 2: 100 nodes (2 × Xeon E5-2670v2 (20 cores), 256GB RAM, IB FDR) and 240TB storage.• HPC HUB 3: 150 nodes (2 × Xeon E5-2650v2 (16 cores), 128GB RAM, 2 × Tesla K80, IB FDR) and 240TB storage.

Advania, CESGA, TACC(17), HLRS (14), LANL(10)

Page 44: Environment for training models

Dmitry Spodarets

[email protected]