Environment for training models Dmitry Spodarets AI Rush
Environment for training modelsDmitry Spodarets
AI Rush
Who am I
Dmitry Spodarets• Founder and CEO at FlyElephant
• PhD candidate at Odessa National University
• Lecturer at Odessa Polytechnic University • Organizer of technical conferences about AI,
BigData, HPC, JS, Web Technologies …
Agenda
•Data Science Tools Survey Results• Computing resources• Clouds (AWS & Azure)• Containers (Docker, Singularity)• FlyElephant platform for Data Science
Data Science Tools Survey
220datascientist
Datasets
0
10
20
30
40
50
60
70
lessthan1MB
1.1to10MB
11to100MB
101MBto1GB
1.1to10GB
11to100GB
101GBto1
Terabyte
1.1to10TB
11to100TB
101TBto1
Petabyte
1.1PBto10
Petabyte
11to100PB
over100PB
Datasets
Datasets
Tools for collecting data
Python 45
R 26
Spark 18
SQL 15
Excel 13
Kafka 11
Pandas 10
custom 8
Hadoop 5
Numpy 5
SAS 5
Tools for storing data
PostgreSQL 37
CSV 31
MySQL 21
Hadoop 16
Excel 15
HDFS 15
Mongodb 15
MyServer 12
Oracle 11
Hive 8
Programming languages
Python 151
R 88
SQL 37
Java 32
Scala 22
bash 17
C++ 17
JavaScript 15
C# 13
vba 8
C 6
Libraries
Pandas 88
Numpy 68
scikit-learn 48
scipy 26
dplyr 20
matplotlib 20
ggplot2 15
keras 14
SPARK 13
xgboost 13
Tensorflow 12
Tools for the visualization of data
matplotlib 66
ggplot 40
seaborn 33
Excel 22
Tableau 22
R 19
plotly 13
bokeh 12
d3 11
Clouds
aws 77
none 41
azure 25
google 24
digital ocean 9
OpenStack 7
Watson 1
The Jupyter Notebook
Jupyter Lab
Computing resources
Computing resources
Computing resources
NVIDIADGX-1Deep Learning Supercomputer170/3TFLOPS(GPUFP16/CPUFP32)
nvidia tesla p100~5 TeraFLOPS
~3TeraFLOPS
Image Training Performance on GoogLeNet
251,77425,38
569,1467,73
791,96
1230,63
0
200
400
600
800
1000
1200
1400
1GPU(1.86X) 2GPUs(1.87X) 4GPUs(2.2X)
TeslaK80 TeslaP100
http://www.nvidia.com/object/caffe-benchmarks.html
images
traine
dpe
rsecon
d
1080 vs Titan X vs K80 vs P100
0,25
8,8
0,3
10,1
2,9
8,7
5,3
10,6
0
2
4
6
8
10
12
FP32(Singleprecision) FP64(Doubleprecision)
1080 TitanX K80 P100
http://www.nvidia.com/
TFLO
PS
Problem
Effective parallelization of algorithms
NVIDIA Deep Learning SDK
Computing power (Intel)
• Intel Math Kernel Library (Intel MKL)Natively supports C, C++ and Fortran Development. Cross-language compatible with Java, C#, Python and other languages.
• Intel Data Analytics Acceleration Library (Intel DAAL)Includes Python, C++, and Java APIs and connectors to popular data sources including Spark and Hadoop.
• Intel MPI LibraryNatively supports C,C++ and Fortran development
Books
Clouds
Clouds
P2-series N-series 16X K80 4X K80X1-series H-Series
128 vCPU / 1952 GB 16 vCPU / 224 GBC4-series
36 vCPU / 60 GBaws.amazon.com/marketplace/ azuremarketplace.microsoft.com
Azure CLI
1. sudo pip install azure-cli2. az login3. az group create --name GroupName --location EastUS4. az vm create --resource-group GroupName --name MyVM --image
Canonical:UbuntuServer:16.04-LTS:latest --size Standard_NC6 --storage-sku Standard_LRS --admin-username user --ssh-key-value ~/.ssh/id_rsa.pub
5. az vm deallocate --resource-group GroupName --name MyVM6. az vm start --resource-group GroupName --name MyVM7. az vm list-ip-addresses --resource-group GroupName --name
MyVM8. az vm delete --resource-group GroupName --name MyVM9. az group delete --name GroupName
Data Science images in Azure Marketplace
Data Science images in AWS Marketplace
Containers
Docker
Docker (Dockerfile)
FROM gcr.io/tensorflow/tensorflow
MAINTAINER Dmitry Spodarets <[email protected]>
RUN apt update && apt -y upgrade && apt -y install git curl wget
CMD /run_jupyter.sh
Docker (build.sh)
#!/bin/bashfunction docker_build {
docker build -t $1 ./$1; docker tag $1 registry.flyelephant.net/$1 docker push registry.flyelephant.net/$1 docker rmi $1 registry.flyelephant.net/$1
}case $1 in all)
for i in `cat build.list`; do docker_build $i;
done ;;
*) docker_build $1;;
esac
Docker Hub
https://hub.docker.com/
Docker
1. docker images
2. docker run --memory 512m --cpus="2" --name mycont registry.flyelephant.net/tensorflow
3. docker exec -i -t mycont bash
4. docker ps
5. docker stats
6. docker stop CONTAINER ID
7. docker start CONTAINER ID
8. docker rm CONTAINER ID
Docker Machine
• Amazon Web Services
• Digital Ocean
• Exoscale
• Generic
• Google Compute Engine
• IBM Softlayer
• Microsoft Azure
• Microsoft Hyper-V
• OpenStack
• Oracle
• VirtualBox
• Rackspace
• VMware Fusion
• VMware v
• Cloud Air
• VMware vSphere
docker-machine create --driver azure --azure-subscription-id subscription-id --azure-resource-group resourcename --azure-ssh-user user --azure-size machine-name
docker-machine ssh machine-name
Singularity
Singularity - Containers for Science
• First public release in April 2016, followed by a massive uptake•HPC Wire Editor’s choice: Top Technologies to Watch for 2017• Simple integration with resource managers, InfiniBand, GPUs, MPI, file
systems, and supports multiple architectures (x86_64, PPC, ARM, etc..)• Limits user’s privileges (inside user == outside user)•No root owned container daemon•Network images are supported via URIs and all require local caching:
○ docker:// - This will pull a container from Docker Hub
○ http://, https:// - This will pull an image or tarball from the URL, cache and run it
○ shub:// - Pull an image from the Singularity Hub
Singularity - Usage Examples
$ python ./hello.pyHello World: The Python version is 2.7.5$ sudo singularity exec --writable /tmp/debian.img apt-get install python…$ singularity exec /tmp/debian.img python ./hello.pyHello World: The Python version is 2.7.12
Webinar"IntroductiontoSingularity"https://youtu.be/h5rDnCA3NJA
Contributors to Singularity
Network Based Computing LabOhio State University
• High-Performance Big Data (HiBD)http://hibd.cse.ohio-state.edu/
• High-Performance Deep Learning (HiDL)http://hidl.cse.ohio-state.edu/
FlyElephant
FlyElephant platform for Data Science
We automate Data Scienceand help teams to work efficiently.
Computing resources
Ready-computing infrastructure
Collaboration& Sharing
Fast Deployment
Expert Community
Ready-computing infrastructure
Jupyter orother IDE
Automatic running of tasks
Server orCluster
Our resources
• Public Clouds: Azure & AWS.• Private cloud based on OpenStack.• HPC-clusters based on SLURM.• Docker-clusters based on Swarm / Singularity.
• Tools and languages: R, Python, Java, Scala, C/C++, Julia, OpenFOAM, Octave, PyFR,
Scilab, GROMACS, MATLAB, Intel MKL, FlowVision, ANSYS, COMSOL, AVL, Hadoop, Spark, H2O, Anaconda, Octave, scikit-learn, Tensorflow, Theano, Caffe, etc.
FlyElephant US 1 Cloud (P100, K80, Titan X, FPGA (Xilinx))
• HPC HUB 1: 80 nodes (2 × Xeon E5-2680v2 (20 cores), 64GB RAM, IB FDR) and 240TB storage.• HPC HUB 2: 100 nodes (2 × Xeon E5-2670v2 (20 cores), 256GB RAM, IB FDR) and 240TB storage.• HPC HUB 3: 150 nodes (2 × Xeon E5-2650v2 (16 cores), 128GB RAM, 2 × Tesla K80, IB FDR) and 240TB storage.
Advania, CESGA, TACC(17), HLRS (14), LANL(10)