Joseph E. Gonzalez [email protected]Deep Learning Overview Borrowed heavily from excellent talks by: • Adam Coates: http://ai.stanford.edu/~acoates/coates_dltutorial_2013.pptx • Fei-Fei Li and Andrej Karpathy: http://cs231n.stanford.edu/syllabus.html
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Borrowed heavily from excellent talks by:• Adam Coates: http://ai.stanford.edu/~acoates/coates_dltutorial_2013.pptx• Fei-Fei Li and Andrej Karpathy: http://cs231n.stanford.edu/syllabus.html
Machine Learning è Function Approximation
Label:Cat
Object Recognition Speech Recognition
“The cat in the hat”
Robotic Control
The cat in the hat.
El gato en el sombrero
Machine Translation
Label:Cat
Object Recognition
Speech Recognition
“The cat in the hat”
Function Approximation Pipeline
Label:Cat
Object Recognition
Speech Recognition
“The cat in the hat”
SVM
HMM
Function Approximation Pipeline
Often build multiple layers of features to abstract the input
Label:Cat
SVM
Label:Cat
SVM
Deep learning tries to automated this process.
Function Approximation Pipeline
Deep Learning: automatically learn a deep hierarchy of abstract features along with the classifier.
Ø Typically using neural networksØ composable general function approximators
Label:Cat
SVMLow-levelFeatures
Mid-levelFeatures
High-levelFeatures
Learn more abstract representation
Function Approximation Pipeline
Why is Deep Learning so Successful?Ø Feature engineering essential to many applications
Ø Expensive hand-engineering of “layers” of representation.Ø Deep learning automates the process of feature engineering
Ø Previous attempts were limited by data and computationØ We now have access to substantial amounts of data and computation
Ø Deep learning techniques are inherently compositionalØ Easy to extend and combine à rapid development
Crash Course in Neural Networks
Supervised LearningØ Predict is this a picture of a cat?
x =
Supervised LearningØ Predict is this a picture of a cat?
Ø and training data:
Ø Learn a function:
Ø By estimating the parameters w
=
Dtrain = {(xi, yi)}ni=1
2 Rd
fw(x) = y
y 2 {0, 1}x =
Logistic Regression for Binary ClassificationØ Consider the simple function family:
Ø With non-linearity:
�(u) =1
1 + exp(�u)
fw(x) = �
�w
Tx
�= �
0
@dX
j=1
wjxj
1
A = P (y = 1 |x)
Logistic Regression as a “Neuron”Ø Consider the simple function family: �(u) =
1
1 + exp(�u)
fw(x) = �
�w
Tx
�= �
0
@dX
j=1
wjxj
1
A = P (y = 1 |x)
w52x52
Neuron “fires” if weighted sum of input is greater than zero.
“perceptron”
Bias= w0 x
Learning the logistic regression modelØ Consider the simple function family:
Ø Goal: find w that minimizes the loss on the training data:
�(u) =1
1 + exp(�u)
fw(x) = �
�w
Tx
�= �
0
@dX
j=1
wjxj
1
A = P (y = 1 |x)
L(w) =nX
i=1
L (fw(xi), yi) =nX
i=1
fw(xi)yi (1� fw(xi))
1�yi
likelihood
Numerical OptimizationØ Gradient Descent:
Ø Convex à Guaranteed to find optimal wØ Stochastic gradient descent:
w(t+1) = w(t) � ⌘trwL(w)�rw
L(w)
L(w)
w
rwL(w) =nX
i=1
rwL (fw(xi), yi)
⇡ nrwL (fw(xi), yi) for (xi, yi) ⇠ D
Slow
Approximate
Logistic Regression: Strengths and LimitationsØ Widely used machine learning technique
Ø convex à efficient to learnØ easy to interpret model weightsØ works well given good features
Ø Limitations:Ø Restricted to linear relationships à sensitive to choice of features
y = 0
y = 1
1
2
Pixe
l 2
Pixel 1
Pixe
l 2
Pixel 1
1
1
0
0
Feature Engineering
Ø Rather than use raw pixels build/train feature functions:
1
2
Pixe
l 2
Pixel 1
Pixe
l 2
Pixel 1
1
1
0
0
g = (Paws, Furry)Fu
rryPaws
11
0 0fw(g(x))
fw(x)
Composition Linear Models and Nonlinearities
x1
x2
xd
Input Layer(Pixels)
d1
zk
1
W0k
d
= zk
1
k
1
𝜎 =h1
h2
hk
Composition Linear Models and Nonlinearities
k1
h1
h2
hk
W12
k
z2
1
= 𝜎 z2
1
2
1
h1
h2=
Neural NetworksØ Composing “perceptrons”
y = fW 0,W 1,W 2(x) = �
�W
2�
�W
1�
�W
0x
���
x1
x2
xd
h11
h12
h1k1 h2
k2
h22
h21
f
W0 W1
W2
Hidden Layers
Inpu
t Lay
er
x ! �(W 0x) ! h
1 ! �(W 1h
1) ! h
2 ! �(W 2h
2) ! f
Output
input
Conv7x7+2(S)
MaxPool3x3+2(S)
LocalRespNorm
Conv1x1+1(V)
Conv3x3+1(S)
LocalRespNorm
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
AveragePool5x5+3(V)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
AveragePool5x5+3(V)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
AveragePool7x7+1(V)
FC
Conv1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure 3: GoogLeNet network with all the bells and whistles
Table 2 Inference performance, power, and energy efficiency on Titan X and Xeon E5-2698 v3.
The comparison between Titan X and Xeon E5 reinforces the same conclusion as the comparison between Tegra X1 and Core i7: GPUs appear to be capable of significantly higher energy efficiency for deep learning inference on AlexNet. In the case of Titan X, the GPU not only provides much better energy efficiency than the CPU, but it also achieves substantially higher performance at over 3000 images/second in the large-batch case compared to less than 500 images/second on the CPU. While larger batch sizes are more efficient to process, the comparison between Titan X and Xeon E5 with no batching proves that the GPU’s efficiency advantage is present even for smaller batch sizes. In comparison with Tegra X1, the Titan X manages to achieve competitive Performance/Watt despite its much bigger GPU, as the large 12 GB framebuffer allows it to run more efficient but memory-capacity-intensive FFT-based convolutions.
Finally, Table 3 presents inference results on GoogLeNet. As mentioned before, IDLF provides no support for GoogLeNet, and alternative deep learning frameworks have never been optimized for CPU performance. Therefore, we omit CPU results here and focus entirely on the GPUs. As GoogLeNet is a much more demanding network than AlexNet, Tegra X1 cannot run batch size 128 inference due to insufficient total memory capacity (4GB on a Jetson™ TX1 board). The massive framebuffer on the Titan X is sufficient to allow inference with batch size 128.
Table 3 GoogLeNet inference results on Tegra X1 and Titan X. Tegra X1's total memory capacity is not sufficient to run batch size 128 inference.
Compared to AlexNet, the results show significantly lower absolute performance values, indicating how much more computationally demanding GoogLeNet is. However, even on GoogLeNet, all GPUs are capable of achieving real-time performance on a 30 fps camera feed.