Deep learning made doubly easy with reusable deep features Carlos Guestrin Dato, CEO University of Washington, Amazon Prof. of ML
Deep learning made doubly easy with reusable deep features
Carlos GuestrinDato, CEOUniversity of Washington, Amazon Prof. of ML
Successful apps in 2015 must be
intelligent
Machine learning
key to next-gen apps
• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized
medicine • Churn prediction• Smart UX
(video & text)• Personal assistants• IoT• Socials nets• …
Last decade: Data management
Now: Intelligent apps
?Last 5 years:
Traditional analytics
The ML pipeline circa 2013
DATAML
Algorithm
My curve is better
than your curve
Write a paper
2015: Production ML pipeline
DATA
Your Web Service or
Intelligent App
MLAlgorith
m
Data cleaning
& feature
eng
Offline eval &
Parameter search
Deploy model
Data engineering Data intelligence Deployment
Using deep learning
Goal: Platform to help implement, manage, optimize entire pipeline
Today’s talk
Features in ML
Neural networks
Deep learning
for computer
vision
Deep learning made
easy with deep
features
Applications to text
data
Deployment in
production
Features are key to machine learning
7
Simple example: Spam filtering• A user walks into an email…
- Will she thinks its spam?
• What’s the probability email is spam?
Text of email
User info
Source info
Input: x
MODELYes!
No
Output: Probability of y
8
Feature engineering: the painful black art of transforming raw inputs into useful inputs for ML algorithm
• E.g., important words, stemming text, complex transformation of inputs,…
MODELYes!
No
Output: Probability of y
Feature extractio
n
Features: Φ(x)
Text of email
User info
Source info
Input: x
Neural networks
Learning *very* non-linear features
10
Linear classifiers• Most common classifier
- Logistic regression- SVMs- …
• Decision correspond to hyperplane:- Line in high dimensional
space
w 0 +
w1 x
1 + w
2 x2 =
0
w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0
11
Graph representation of classifier:useful for defining neural networks
x1x2
xd
y…
1 w0
w1
w2
w d
w0 + w1 x1 + w2 x2 + … + wd xd
> 0, output 1
< 0, output 0
Input Output
12
What can a linear classifier representx1 OR x2 x1 AND x2
x1x2
1
y
-0.511
x1x2
1
y
-1.511
13
What can’t a simple linear classifier represent?
XOR the counterexample
to everything
Need non-linear features
Solving the XOR problem: Adding a layer
XOR = x1 AND NOT x2 OR NOT x1 AND x2
z1
-0.5
1
-1
z1 z2
z2
-0.5
-1
1
x1
x2
1
y
1 -0.5
1
1
Thresholded to 0 or 1
15
A neural network• Layers and layers and layers of linear models and non-linear
transformation
• Around for about 50 years- Fell in “disfavor” in 90s
• In last few years, big resurgence- Impressive accuracy on a several benchmark problems- Powered by huge datasets, GPUs, & modeling/learning alg
improvements
x1
x2
1
z1
z2
1
y
Applications to computer vision(or the deep devil is in the deep details)
17
Image features• Features = local detectors
- Combined to make prediction- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth
18
Many hand create features exist…
19
Standard image classification approachInput Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?
20
Many hand create features exist…
… but very painful to design
21
Use neural network to learn features Each layer learns features, at different levels of abstraction
22
Many tricks needed to work well… • Different types of layers, connections,… needed for high accuracy
Krizhevsky et al. ‘12
Sample performance results
Sample results• Traffic sign recognition
(GTSRB)- 99.2% accuracy
• House number recognition (Google)- 94.3% accuracy
30
Krizhevsky et al. ’12: 60M parameters, won 2012 ImageNet competition
31
32
ImageNet 2012 competition: 1.5M images, 1000 categories
32
33 33
34
Application to scene parsing
Challenges of deep learning
Deep learning score cardPros• Enables learning of features
rather than hand tuning
• Impressive performance gains on- Computer vision- Speech recognition- Some text analysis
• Potential for much more impact
Cons
Deep learning workflow
Lots of labeled
data
Training set
Validation set
80%
20%
Learn deep
neural net model
Validate
Adjust hyper-parameters,
model architecture,…
Deep learning score cardPros• Enables learning of features
rather than hand tuning
• Impressive performance gains on- Computer vision- Speech recognition- Some text analysis
• Potential for much more impact
Cons• Computationally really expensive• Requires a lot of data for high
accuracy• Extremely hard to tune
- Choice of architecture- Parameter types- Hyperparameters- Learning algorithm- …
• Computational + so many choices = incredibly hard to tune
Deep features: Deep
learning
+ Transfer
learning
40
Change image classification approach?Input Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?Can we learn features
from data, even when
we don’t have data or time?
41
Transfer learning:Use data from one domain to help learn on another
Lots of data:Learn
neural netGreat
accuracy on cat v. dogvs.
Some data: Neural net as feature extractor
+Simple classifier
Great accuracy on 101
categories
Old idea, explored for deep learning by Donahue et al. ’14
42
What’s learned in a neural netNeural net trained for Task 1: cat vs. dog
Very specific to Task 1Should be ignored for other tasks
More genericCan be used as feature extractor
vs.
43
Transfer learning in more detail…
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1Should be ignored for other tasks
More genericCan be used as feature extractor
Keep weights fixed!
For Task 2, predicting 101 categories, learn only end partUse simple classifiere.g., logistic regression, SVMs
Class?
44
Careful where you cut…Last few layers tend to be too specific
Too specific for car
detectionUse
these!
Transfer learning with deep features
Training set
Validation set
80%
20%
Learn simple model
Some labeled
data
Extract features
with neural net trained on different
task Validat
e Deploy in production
How general are deep features?
Applications to text data
Simple text classification with bag of wordsaardvark0about 2all 2Africa 1apple 0anxious 0...gas 1...oil 1…Zaire 0
Use simple classifiere.g., logistic regression, SVMs
Class?
One “feature” per word
Word2Vec: Neural network for finding high dimensional representation per word Mikolov et al. ‘13
Skip-gram Model: From a word, predict nearby words in sentence
Awesome learning talk at Strata
deep
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
Neural net
Viewed as deep features
50
Related words placed nearby high dim space
Projecting 300 dim space into 2 dim with PCA (Mikolov et al. ’13)
Classifier:e.g., logistic regression, SVMs with300 x number_of_words parameters
Class?
Embed each
word into 300 dim space
Text classification with word embeddingsaardvark0about 2all 2Africa 1apple 0anxious 0...gas 1...oil 1…Zaire 0
Practical example
Blog corpus HahaYeaHahahaHahahLisxcUmmHehelaughingoutloud
LOLClosest words
in 300 dim
Predicts gender of author with 79% accuracy
Deploying ML in production
55
DATAML
Algorithm
Deployment?
• Write spec, other team implements in ‘production’ languageo 6-12 monthso Stale/irrelevant model/approacho 2 teams maintaining 2 systems
Custom Model
Data Engineers, Data Architects, DevOps,
App Developers
AppAPI
Data Scientist
ML deployment requirements
56
Easy to integrate
Rest APIScalable
Fault tolerant
FlexibleAny model, any Python
AppAPI
API
CACHE
API
CACHE
API
CACHE
LB GLC Model
GLC Model
GLC ModelDato
Models
DatoModels
DatoModels
API
CACHE
API
CACHE
API
CACHE
LB GLC Model
GLC Model
GLC ModelPytho
n
Python
Python
57
Do-It-Yourself• Web Service layer:
- Tornado, Flask, Keen, Django, …• Caching layer:
- Redis, Cassandra, Memcached, DynamoDb, MySQL, …
• Logs: - Logback, LogStash, Splunk, Loggly, …
• Metrics: - AWS CloudWatch, Mixpanel, Librato, …
API
CACHE
API
CACHE
API
CACHE
LB GLC Model
GLC Model
GLC ModelPython
Python
PythonApp
58
… or use Dato Predictive Services
Your Web Service or
Intelligent AppML Model
DatoPredictive ServicesCaching Layer
Predictive Object Server
Serves predictions in a robust, scalable, incremental fashion
BetterML Model
Serve any model: GraphLab Create, scikit-learn, Python, …
• Out-of-core computation• Tools for feature engineering• Rich data type support
• Models built for scale• App-oriented toolkits• Advanced ML & Extensible
• Deploy models as low-latency REST services• Same code for distributed computation• Elastically scale up or out with one command• Job monitoring & model management• Deploy existing Python code & models• Run on AWS EC2 or Hadoop Yarn
SGraphCreate Engine
SFrameCanvas
Machine Learning Toolkits SDK
GraphLab Create Dato DistributedDato Predictive Services
Predictive Engine
REST Client DirectModel Mgmt
Clean Learn Deploy
Distributed Engine
DirectJob ClientJob Mgmt
Dato Platform
Summary
Deep learning made easy with deep featuresDeep learning: exciting ML development
Slow, lots of tuning, needs lots of data
Deep features: reuse deep models for new domains
Needs less data Faster training times
Much simpler tuning
Can still achieve excellent performance