Page 1
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tom 'Elvis' Jones, Partner Solutions Architect, Amazon Web Services
Diego Oppenheimer, CEO, Algorithmia
December 1, 2016
Bringing Deep Learning to the Cloud
with Amazon EC2
CMP314
Page 2
* As of 1 October 2016
2009
48
280
722
82
2011 2013 2015
AWS has been continually expanding its services to support virtually any cloud workload
and now has more than 70 services that range from compute, storage, networking,
database, analytics, application services, deployment, management and mobile. AWS
has launched a total of 706 new features and/or services year to date* - for a total of
2,601 new features and/or services since inception in 2006.
AWS Pace of Innovation
Page 3
Microservices
Serverless
Lambda
Pay as you go
Scalability
Page 4
Machine learning
Machine learning is the technology that
automatically finds patterns in your data and
uses them to make predictions for new data
points as they become available
Your data + machine learning = smart applications
Page 5
New P2 GPU Instance Types
• New EC2 GPU instance type for accelerated computing
• Offers up to 16 NVIDIA K80 GPUs (8 K80 cards) in a single instance
• The 16xlarge size provides:
• A combined 192 GB of GPU memory, 40 thousand CUDA cores
• 70 teraflops of single precision floating point performance
• Over 23 teraflops of double precision floating point performance
• Example workloads include:
• Deep learning, computational fluid dynamics, computational finance,
seismic analysis, molecular modeling, genomics, VR content rendering,
accelerated databases
Page 6
P2 Instance Types
Three P2 instance sizes:
Instance Size GPUs GPU
Peer to
Peer
vCPUs Memory
(GiB)
Network
Bandwidth*
p2.xlarge 1 - 4 61 1.25Gbps
p2.8xlarge 8 Y 32 488 10Gbps
p2.16xlarge 16 Y 64 732 20Gbps
*In a placement group
Page 7
P2 Instance Summary
High-performance GPU instance types, with many innovations
• Based on NVIDIA K80 GPUs, with up to 16 GPUs in a single instance
• With dedicated peer-to-peer connections supporting GPUDirect
• Intel Broadwell processors with up to 64 vCPUs and up to 732GiB RAM
• 20Gbps network on EC2, using Elastic Network Adaptor (ENA)
• Supporting a wide variety of ISV applications and open-source
frameworks
Page 9
Diego Oppenheimer – CEO, Algorithmia
Product developer, entrepreneur, extensive background
in all things data.
Microsoft: PowerPivot, PowerBI, Excel, and SQL Server
Founder of algorithmic trading startup
BS/MS Carnegie Mellon University
Page 10
Make state-of-the-art
algorithms accessible and
discoverable by everyone.
Page 11
A marketplace for algorithms...
We host algorithms
Anyone can turn their algorithms into scalable web services
Typical users: scientists, academics, domain experts
We make them discoverable
Anyone can use and integrate these algorithms into their solutions
Typical users: businesses, data scientists, app developers, IoT makers
We make them monetizable
Users of algorithms pay for algorithms they use
Typical scenarios: heavy-load use cases with large user base
Page 12
Sample algorithms (2600+ and growing daily)
● Text analysis summarizer, sentence tagger, profanity detection
● Machine learning digit recognizer, recommendation engines
● Web crawler, scraper, pagerank, emailer, html to text
● Computer vision image similarity, face detection, smile detection
● Audio & video speech recognition, sound filters, file conversions
● Computation linear regression, spike detection, fourier filter
● Graph traveling salesman, maze generator, theta star
● Utilities parallel for-each, geographic distance, email validator
Page 13
Machine Intelligence Stack
Applications
Scalable CPU/GPU compute
Algorithms as micro services
Data stores
Page 14
Scale?
• Support the workloads of 32,000 developers
• Mixed use CPU/GPU
• Spikey traffic
• Heterogeneous hardware
Page 15
Hosting Deep Learning
Cloud hosting of deep learning models can be especially challenging
due to complex hardware and software dependencies
At Algorithmia we had to:
• Learn how to host and scale the 5 most common deep learning
frameworks (more coming).
• Deal with scaling and spikey traffic on GPUs
• Deal with multitenancy on GPUs
• Build an extensible and dynamic architecture to support deep
learning
Page 16
First: What is Deep Learning?
• Deep learning uses artificial neurons similar to the brain to
represent high-dimensional data• The mammal brain is organized in a deep architecture, e.g., the visual
system has 5 to 10 levels. [1]
• Deep learning excels in tasks where the basic unit (a single pixel,
frequency or word) has very little meaning in and of itself, but
contains high-level structure. Deep nets have been effective at
learning this structure without human intervention.
[1] Serre, Kreiman, Kouh, Cadieu, Knoblich, & Poggio,
Page 17
What is Deep Learning Being Applied To?
Primarily: huge growth in unstructured data
• Pictures
• Videos
• Audio
• Speech
• Websites
• Emails
• Reviews
• Log files
• Social media
Page 18
Today’s Use Cases
• Computer vision• Image classification• Object detection• Face recognition
• Natural language• Speech to test• Chatbots• Q&A systems (Siri, Alexa, Google Now)• Machine translation
• Optimization• Anomaly detection• Recommender systems
Page 19
Why Now?…and why is deep learning suddenly everywhere?
Advances in research
• LeCun, Gradient-Based Learning Applied to
Document Recognition,1998
• Hinton, A Fast Learning Algorithm for Deep
Belief Nets, 2006
• Bengio, Learning Deep Architectures for AI,
2009
Advances in hardware
• GPUs: 10x performance, 5x energy efficiency
http://www.nvidia.com/content/events/geoInt2015/LBrown_DL_Image_ClassificationGEOINT.pdf
Page 20
Deep Learning Hardware (2016)
GPUs: NVIDIA is dominating
One of the first GPU neural nets was on a NVIDIA
GTX 280 up to 9 layers neural network (2010
Ciresan and Schmidhuber)
• NVIDIA chips tend to outperform AMD
• More importantly, all the major frameworks use
CUDA as a first-class citizen. Poor support for
AMD’s OpenCL.
Page 21
Deep Learning Hardware
GPU:
• Becoming more tailored for deep learning (e.g., Pascal chipset)
Custom hardware:
• FPGA (AWS F1, MSFT Project Catapult)
ASIC:
• Google TPU
• IBM TrueNorth
• Nervana Engine
• Graphcore IPUs
Page 22
GPU Deep Learning Dependencies
Meta deep learning framework
Deep learning framework
cuDNN
CUDA
NVIDIA driver
GPU
Page 23
Deep Learning Frameworks
Page 24
Theano
Created by Université de Montréal
Theano pioneered the trend of using a symbolic graph for programming a network. Very mature framework, good support for many kinds of networks.
Pros:• Use Python + Numpy• Declarative computational graph• Good support for RNNs• Wrapper frameworks make it more accessible (Keras, Lasagne, Blocks)• BSB License
Cons:• Low level framework• Error messages can be unhelpful• Large models can have long compile times• Weak support for pre-trained models
Page 25
Torch
Created by collaboration of various researchers. Used by DeepMind (prior to Google). Torch is a general scientific computing framework for Lua.
Torch is more flexible than TensorFlow and Theano in that it’s imperative while TF/Theano are declarative. That makes some operations (e.g, beam search) much easier to do.
Pros:• Very flexible multidimensional array engine• Multiple back ends (CUDA and OpenMP)• Lots of pre-trained models available
Cons:• Lua• Not good for recurrent networks• Lack of commercial support
Page 26
Caffe
Created by Berkley Vision and Learning center and community contributors.
Probably the most used framework today, certainly for CV.
Pros:
• Optimized for feedforward networks, convolutional nets and image processing.
• Simple Python API
• BSD License
Cons:
• C++/CUDA for new GPU layers
• Limited support for recurrent networks (recently added)
• Cumbersome for big networks (GoogLeNet, ResNet)
Page 27
TensorFlow
Created by Google.TensorFlow is written with a Python API over a C/C++ engine. TensorFlow generates a computational graph (e.g., series of matrix operations) and performs automatic differentiation.
Pros:• Uses Python + Numpy• Lots of interest from community• Highly parallel, and designed to use various back ends (software, gpu, asic)• Apache License
Cons:• Slower than other frameworks [1]
• More features, more abstractions than torch• Not many pre-trained models yet
[1] https://arxiv.org/pdf/1511.06435v3.pdf
Page 28
Networks for Training
Where to get networks:
• If you’re just interested in using deep learning to classify images, you can
usually find off-the-shelf networks.
• VGG, GoogleNet, AlexNet, SqueezeNet
• Caffe Model Zoo
Page 29
Training vs. Running
Deep learning generally consists of two phases: training and running.
Training deep learning models is challenging, with many solutions available
today.
Running deep learning models (at scale) is the next step, and has its own
challenges.
Page 30
Hosting Deep Learning
Making deep learning models available as an API represents a
unique set of challenges that are rarely, if ever, addressed in
tutorials.
Page 31
Why ML in the Cloud?
• Need to react to live user data
• Don’t want to manage own servers
• Need enough servers to sustain max load. You can save money
using cloud services
• Limited compute capacity on mobile
Page 32
http://blog.algorithmia.com/2016/07/cloud-hosted-deep-learning-models
Use case: http://colorize-it.com
Page 33
Capacity required at peak
Capacity required at peak
Potential for ghost/wasted
capacity
Without elastic
compute on GPUs
our cost would be
~75% more
Page 34
Service Oriented Architecture
Going to want a dedicated infrastructure for handling
computationally intensive tasks like deep learning
Page 35
LO
AD
BA
LA
NC
ER
S
CPU WORKER #1
CPU WORKER #N
Docker(algorithm#1)
Docker(algorithm#2)
..
Docker(algorithm#n)
CLIENTS
AP
I S
ER
VE
RS
AP
I S
ER
VE
RS
GPU WORKER #1
GPU WORKER #N
Docker(deep-algo#1)
Docker(deep-algo#2)
..
Docker(deep-algo#n)
m4
m4
m4
x1
Page 36
Why P2s?
• More video memory
• 12 GB per GPU
• Modern CUDA support
• More CUDA cores to run in parallel
• New messages
• In particular, we had that problem with CUDA 3.0 not allowing
us to share memory as efficiently
• Price per flop
Page 37
Customer Showcase:
• CSDisco offers cutting-edge eDiscovery technology for attorneys “DISCO ML”.
• DISCO ML is a deep learning based exploration tool that asynchronously re-learns as
attorneys move through their normal discovery workflows.
• “The proprietary, multi-layer artificial network uses deep learning and arrays of GPUs
to unpack and process learned information quickly. Combining Google’s advanced
Word2Vec embeddings with powerful convolutional and recurrent neural networks for
text...”
Page 38
Customer Showcase:
Why they chose to host their ML on Algorithmia ?
• Scalability: Required scalable GPU based compute fabric for their neural net based
ML approach to on-board hundreds of new customers without taxing their engineering
department.
• Flexibility: Expected high peaks of usage during certain hours.
• Reduce ghost compute: excess capacity = unnecessary cost
• Ability to chain algorithms: Process is a series of operations from scoring, training ,
validation, querying – each is its own model hosted on Algorithmia. Algorithmia
provides an easy way to pipe algorithms into each other.
Page 39
Challenges and how EC2 helps (with some)
Page 40
Challenge #1: New Hardware, Latest CUDA
AWS: G2 (Grid K520- 2013) and P2 instances
Azure: N-Series
Google: Just announced preview
SoftLayer: various cards including Tesla K80 and M60.
Small providers: Nimbix, Cirrascale, Penguin
Page 41
Challenge #2: Language Bindings
You probably already have an existing stack in some programming
language. How does it talk to your deep learning framework?
Hope you like Python ( or Lua)
Solution: Services!
Page 42
Challenge #3: Large Models
Deep learning models are getting larger
• State of the art networks are easily multi-gigabyte
• Need to be loaded and scaled
Solutions:
• More hardware
• Smaller models
Page 43
Memory Per Model
Size (MB) Error % (top-5)
SqueezeNet
Compressed
0.6 19.7%
SqueezeNet 4.8 19.7%
AlexNet 240 19.7%
Inception v3 84 5.6%
VGG-19 574 7.5%
ResNet-50 102 7.8%
ResNet-200 519 4.8%
Page 44
SqueezeNet
Hypothesis: the networks we’re using today are much larger and
more complicated than they need to be.
Enter SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and < 0.5 MB model size.
Not quite state of the art, but close, and MUCH easier to host.
[1] Iandola, Han, Moskewicz, Ashraf, Dally, Keutz
Page 45
Model Compression
Recent efforts at aimed at pruning the size of networks
“Reduce the storage requirement of neural networks by 35x to 49x
without affecting their accuracy.” (Han, Mao, Dally -2015)
Page 46
Challenge #4: GPU Sharing
GPUs were not designed to be shared like CPUs
• Limited amount of video memory
• Even with multi-context management, memory overflows and
unrestricted pointer logic are very dangerous for other
applications
• Developers need a way to share GPU resources safely from
potentially malicious applications.
Page 47
Challenge #5: GPU Sharing - Containers
Docker – new standard in deploying applications, but adds an
additional layer of challenges to GPU computing.
• NVIDIA drivers must match inside and outside containers.
• CUDA drivers must match inside and outside containers.
• Some algorithms require X windows, which must be started
outside the container and mounted inside
• Nvidia-docker container is helpful but not a complete solution.
• New AWS Deep Learning AMI -> Huge step in right direction.
Page 48
Lessons Learned
• Deep learning in the cloud is still in its infancy
• Hosting deep learning models is the next logical step after the
training model, but the difficulty is underappreciated.
• Tooling and frameworks are making things easier, but there is a
lot of opportunity for improvement
Big picture: the challenges involved with creating DL models is only
half the problem. Deploying them is an entirely different skillset.
Page 50
Thank you!
Try out Algorithmia for free:
Code: reinvent16
Algorithmia.com
Page 51
Remember to complete
your evaluations!