A brief introduction to Computer Vision Michele Ermidoro SPEAKER Michele Ermidoro PLACE UniBG, Dalmine DATE 12 March 2019 Ingegneria dei Sistemi di Controllo AA 2018-2019
A brief introduction to Computer Vision
Michele Ermidoro
SPEAKER
Michele Ermidoro
PLACE
UniBG, Dalmine
DATE
12 March 2019
Ingegneria dei Sistemi di Controllo
AA 2018-2019
Outline
1. What is Machine Vision?
2. Digital image: start from basics
• Image representation
• Image processing
3. Classic approach
4. Convolutional Neural Network
5. An Object Detection Pipeline
6. Deep Learning Framework
2
What is Machine Vision?
Computer vision (CV), from the perspective of engineering, it seeks to automate tasks that the human visual system can do.Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions.Understanding in this context means the transformation of visual images (the input of the retina) into descriptions of the world that can interface with other thought processes and elicit appropriate action.
[https://en.wikipedia.org/wiki/Computer_vision]
Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industryMachine vision as a systems engineering discipline can be considered distinct from computer vision, a form of computer science. It attempts to integrate existing technologies in new ways and apply them to solve real world problems.
[https://en.wikipedia.org/wiki/Machine_vision]
What is Machine Vision?
Computer
Vision
Machine
Vision
Image
Processing
Neurobiology Imaging
Optics
Signal
processing
RoboticsAr tificial
intelligence(AI)
Machine
Learning
Math
Computer
intelligence
Object
Detection
Cognitive
Vision
Geometry
Statistics
Optimization
Human Vision
Lighting
Imaging
sensors
Lenses
Odometry
Navigation
Data
Estimation
Filtering
What is Machine Vision?
• Almost 80% of the data traveling on the net is visual data
• Everybody has a smartphone, and every smartphone has at least 2 cameras
• "A picture is worth a thousand words" is an English language-idiom
• A camera is one of the powerful sensor in a lot of applications
• It has wide fields of application:• Robotics• Surveillance• Industry • Self-driving cars/drones/buses..• Medical• …
http://crcv.ucf.edu/people/faculty/Bagci/research.php
Computer Vision – Why?
Computer Vision – Hype?
Computer Vision – Hype?
≈5.1
Human
Computer Vision – Hype?
≈5.1
Human
What happened
here?
1. Computational Power
2. Convolutional Neural
Network (CNN)
Computational Power
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
2009 - “We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods” “Large-scale Deep Unsupervised Learning using Graphics
Processors” Rajat Raina, Anand Madhavan, Andrew Y. Ng
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
IMAGENET Challenge
Convolutional Neural Network
Computer Vision Tasks
http://vision.stanford.edu
Classification
What’s in the image?
• People• Car• Traffic light• Clock • …
Computer Vision Tasks
Car
http://vision.stanford.edu
Detection
What’s in the image? And Where it is?
• A car in the orange box
CarPerson
Clock
Computer Vision Tasks
http://vision.stanford.edu
Detection
What’s in the image? And Where it is?
• A car in the orange box
• A person in the blue box
• A clock in the green box
Computer Vision Tasks
http://vision.stanford.edu
Segmentation
What’s in the image? And Where it is?
And Which pixels are?
Car
Clock
Computer Vision Tasks
http://vision.stanford.edu
Annotation
How would you describe the picture?
“People crossing a
street while a car
is waiting”
Computer Vision Tasks
https://engineering.matterport.com/splash-of-color-instance-
segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46
• Face detection
• Smile detection
• Eye-open detection
• …
Computer Vision Tasks - Others
https://medium.com/waymo/recreating-the-self-driving-experience-the-making-of-the-waymo-360-video-
37a80466af49
Computer Vision Tasks - Others
Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016
Computer Vision Tasks - Others
Cardiologist-Level Arrhythmia Detection
with Convolutional Neural Networks
Computer Vision Tasks - Others
https://www.youtube.com/watch?v=bcswZLwhTUI
Computer Vision Tasks - Others
Computer Vision Tasks - Others
https://www.youtube.com/watch?v=NrmMk1Myrxc
Computer Vision Tasks - Others
Digital images Basics
Classic approachFeature engineering
CNN
What’s next
Digital image: start from basicsImage representation
Image representation
An image, inside a PC, is just a matrix of numbers
255 -> white
0 -> black
0
1
1
1
1
0
1
1
1
1
1
0
1
1
0
1
4x4x1 matrixColor depth – 1 bitColor channels – 116 pixel
Image representation
An image, inside a PC, is just a matrix of numbers
217
255
255
255
255
191
255
255
255
255
255
127
255
255
0
255
4x4x1 matrixColor depth – 8 bitColor channels – 116 pixel
Image representation
255
255
255
255
255
170
255
255
255
255
255
85
255
255
0
255
4x4x2 matrixColor depth – 8 bitColor channels – 216 pixel
0
255
255
255
255
85
255
255
255
255
255
170
255
255
255
255
Image representation
2453x2453x3 matrixColor depth – 24 bitColor channels – 3Color space - sRGB
6 Mpixel (6.017.209 pixel)
Color spaces
Color space, also known as the color model (or colorsystem), is an abstract mathematical model which simply describes the range of colors as tuples of numbers, typically as 3 or 4 values or color components
There are a variety of color spaces, such as RGB, CMYK, HSV, CIEXYZ..
CMYK
(Cyan, Magenta, Yellow,
Key black)
RGB (Red, Green, Blue)
Color spaces
CIE XYZ
It is the most accurate from a scientific point of view. It tries to represent all the colors that an human eye can see.
RGBThe RGB color model is an additive color model in which red, green and blue light are added. The name comes from the three additive primary colors, red, green, and blue.
HSB (Hue/Sat/Bright)
Designed in the 1970s by
computer graphics researchers
to more closely align with the
way human vision perceives
color-making attributes
Histogram
A color histogram is a representation of the distribution of colors in an image
Gray-scale histogram
Color histogram
Histogram equalization
Digital image: start from basicsImage processing
BinarizationImage binarization
It trasfroms a grey scale image in a binary
image, depending on a threshold
𝑔[𝑛,𝑚] = ቊ255, 𝑓 𝑛,𝑚 > 100
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Filters
Most of the operation on an image are made using filters
• Filters are mathematic functions which are applied to each pixel.
• The filter is represented using a matrix, called Kernel
• Depending on the size of the kernel (3x3, 5x5, 7x7..), the functions will involve the pixel and its neighbors.
• The filters are applied to an image convolvingthe image and the kernel
• The kernel sum must be 1
ConvolutionConvolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.
Input image
Kernel
-13
Output
STEP 1
http://www.songho.ca/dsp/convolution/convolution2d_example.html
Convolution
Input image
Kernel Output
Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.
STEP 5
-13
-18
-20
-24
-17
http://www.songho.ca/dsp/convolution/convolution2d_example.html
Convolution
Input image
Kernel Output
Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel.
STEP 9
-13
-18
13
-20
-24
20
-17
-18
17
http://www.songho.ca/dsp/convolution/convolution2d_example.html
Convolution
Kernel shape
Depending on the shape of the kernel, different operations can be made:
-1 -1-1
-1 -18
-1 -1-1
Edge-Detection
0 0-1
-1 -15
0 0-1
Sharpen
1 11
1 11
1 11
Blur/moving average
1
9
0 00
0 01
0 00
Original
Kernel shape
Depending on the shape of the kernel, different operations can be made:
-1
2
-1
-1
2
-1
-1
2
-1
-1
-1
-1
2
2
2
-1
-1
-1
-1
-1
2
-1
2
-1
2
-1
-1
-1
-1
-1
-1
8
-1
-1
-1
-1
EDGE 45° Lines
Horiz. Lines Vert. Lines
Kernel example
Denoising
Reduce noise in an image
1 12
2 24
1 12
1
16
Filtro Gaussiano
Classic approach
Visual recognition
Analyze images and extract high level information, like what’s in the
picture, where it is..
Different viewpoint Different illumination
Deformations Different shape of the same object
How to identify an object?
Object Bag of ‘Features’
How to identify an object?
The idea is to extract from the image some of its most important characteristics. We’ll call them “features”
These features will be then compared with a dictionary, where the specific knowledge is stored.
The selection of what’s important to search in the image is called features
engineering and extraction
The comparison phase is done through a classifier
The knowledge is passed to the classifier during the training
Classification pipeline
The classifier is a function that receive as input all the representative features
of an image and it decides what’s in the picture.
𝑓 = 𝑎𝑝𝑝𝑙𝑒
𝑓 = 𝑐𝑜𝑤
𝑓 = 𝑡𝑜𝑚𝑎𝑡𝑜
features
extraction
Classification pipeline
Train images
Features Training
Training
labels
Classifier
Test image
features
extractionFeatures Classifier It’s an apple
Prediction
features
extraction
Classification pipeline
Train images
Features Training
Training
labels
Classifier
Test image
features
extractionFeatures Classifier It’s an apple
Prediction
Features engineering Choice of the classifier
Features?
In Computer Vision, a feature is a piece of information which is particularly relevant in the solution of a certain problem (e.g. in the detection of a face, the presence of two eyes is a good feature)
We can identify 3 hierarchical categories of features:
Low-level features• Colors• Edges• Blob• Corners
Mid-level features• Scale-invariant
features• SIFT• SVD
High-level features• Histograms of
gradients (HOG)• Region descriptors
The selection process is called feature engineering
Low Level Features
Edges & Corners:Since the process of image classification involve the exploitation of edges and corners, there are a lot of methods for finding them.
Canny edge detector Harris corner detection
https://dsp.stackexchange.com/questions/14338/corner-detection-using-chris-harris-mike-stephens
It uses a 5 step algorithm, involving some filters (gradient) to compute all the edges in an image.
A mobile window slide over the image and compute the Hessian. Evaluating the eigenvalue of each matrix the corners are detected.
Mid Level Features
SIFT – Scale Invariant Feature Transform [1999]It is an algorithm which is able to detect and describe features in an image. In particular it is able to do this at different scales and rotations.
It has diffent steps which involve a scaling of the image and the computation of the Difference Of Gaussians (DoG)
High Level Features
HOG – Histogram of Oriented Gradients [2005]The technique counts occurrences of gradient orientation in localized portions of an image. The HOG feature descriptor, the distribution ( histograms ) of directions of gradients ( oriented gradients ) are used as features. Gradients ( x and y derivatives ) of an image are useful because the magnitude of gradients is large around edges and corners ( regions of abrupt intensity changes ) and we know that edges and corners pack in a lot more information about object shape than flat regions
https://www.learnopencv.com/histogram-of-oriented-gradients/
Classifier
• The classifier needs a phase of training or learning
• In this phase the classifier “learn” how to distinguish between the classes in output
• This training phase needs a dataset where each images is labelled with the corresponding class name
Training
Training
labels
Classifier
Choice of classifier
DOGS CATS
The classifiers which learn from a labelled dataset are called supervised and they can be divided into 3 major categories:1. Linear (with training)2. Trees (with training)3. Based on distances (without training)
Manual classifier - example
As hypothesis, we want to recognize the type of bottle watching it from above. The
diameter of the neck determines the type of bottle.
Classifierfeatures info
Hough Transform
𝑑1
Hough Transform
𝑑2
IF 𝑑𝑖𝑛 > 𝑡ℎ1 THEN
𝑏𝑜𝑡𝑡𝑙𝑒𝑡𝑦𝑝𝑒 = 1
ELSE
𝑏𝑜𝑡𝑡𝑙𝑒𝑡𝑦𝑝𝑒 = 2
END
1
2
Features extraction
features Manual classifier Class
Supervised classifier – distance example
A more complex problem, is there a cat or a dog in the image?
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Manual
classifier?
Classifierfeatures info
Supervised classifier – distance example
A more complex problem, is there a cat or a dog in the image?
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Classifierfeatures info
x x
xx
x
x
x
x
o
oo
o
o
o
o
x2
x1
+
1-NN - cat
3-NN - dog
5-NN - dog
cat
dog
k-Nearest
Neighbor
FeaturesClassifier
without training
Dataset
Supervised classifier - linear
There are various linear supervised classifier, some of them are:• Logistic regression• SVM• Neural Network
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Training
Labels
Trained
Classifier
cat
Parameters
Supervised learning – tree
We have different classifier based on trees:• Decision trees• Random forest• Gradient Boosting
Kaggle Dogs vs. Cats dataset.
Istogramma RGB - 3D
Training
Labels
Trained
Classifier
cat
Parameters
Viola Jones object detector (Haar Cascades)
• First object detection framework to provide competitive object detection rates in real-time
• Employ Haar Features tocharacterize the input image
Viola, Jones: Robust Real-time Object Detection, IJCV 2001
Each feature resuts in a single value computedsubtracting the sum of pixels under white rectanglefrom the sum of pixels under black rectangle
Viola Jones object detector (Haar Cascades)
• First object detection framework to provide competitive object detection rates in real-time
• Employ Haar Features tocharacterize the input image
• Makes use of a bank of Adaboost classifiers in acascade fashion
Viola, Jones: Robust Real-time Object Detection, IJCV 2001
Sub windows of the image at different scales are passedthrougth a series of classifier and discarded if they fail in any of the stage
Viola Jones object detector (Haar Cascades)
• First object detection framework to provide competitive object detection rates in real-time
• Employ Haar Features tocharacterize the input image
• Makes use of a bank of Adaboost classifiers in acascade fashion
https://vimeo.com/12774628
N. Dalal and B. Triggs Histograms of oriented gradients for human detection CVPR, 2005
Step 1: scan image at
all scales and locations
Step 2: extract features
over each sliding
window location
Step 3: use linear SVM
to classify features
extracted from each
window
Step 4: apply non-
maxima suppression to
obtain final bounding
boxes
Pros/Cons:
+ Real-time
+ Open source software implementation (Dlib)
+ Higher accuracy than Haar Cascades
- Pre-trained only for frontal faces
- Low accuracy on distance and overlapping faces
- High false-positive rate
- High sensitivity to parameter changes
An example – HOG for pedestrian detection
Where we are?
PASCAL VOC Challenge
It is an online competion for object detection (classification and localization of objects)
20 classes.
11,530 images
27,450 objects
Where we are?
• Plateau of results between2011/2012
• Results of 2012 are achieved using a combination of HOG+LBP features and an ensemble of classifier in: Boosted Local Structured HOG-LBP for Object Localization Junge Zhang, Kaiqi Huang, Yinan Yu and Tieniu Tan, 2012
PASCAL Visual Object Classes:• Train/validation/test: 9,963 images
containing 24,640 annotated objects• 20 classes of objects
Where we are?
Convolutional Neural NetworkCNN
Convolutional Neural Network
Train images Training
labels
Test image/s
It’s an applePrediction
Training
Features Classifier
Learned Model
Learned
Features
Learned
Classifier
Learned Model
Learned
Features
Learned
Classifier
Convolutional Neural Network
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analysing visual imagery.
Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field.
CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered.
A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, RELU layer i.e. activation function, pooling layers, fully connected layers and normalization layers.
https://en.wikipedia.org/wiki/Convolutional_neural_network
Convolutional Neural Network
≈5.1
Human
CNN - AlexNet
Conv 1
Conv 2
Conv 5
Conv 4
Conv 3
FC
8
FC
7
FC
6
OU
TP
UT
Inp
ut Im
age
Convolutional Layer
Recap
Convolution
Convolutional Layer
32
32
3
5
5
3
Input image: 32x32x3 (height,width,depth)
Filter: 5x5x3
1. Convolve the filter with the image
(“slide over the image spatially)
2. Filter should have the same depth
of the previous layer (in this case 3)
3. Convolution preserve spatial
structure
Convolutional Layer
28
28
1
Input image: 32x32x3
Filter: 5x5x3
1 number:Convolution result, the dot product
between the filter and a small 5x5x3
chunk of the image
Slide (convolve) over spatial location
Activation Map
Convolutional Layer
28
28
1
Input image: 32x32x3
Filter #2: 5x5x3
Slide (convolve) over spatial location
Activation Map
Convolutional Layer
If we have a 4 filters, we’ll have 4 activation maps
Slide (convolve) over spatial location
We obtain a new 28x28x4 image
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
Convolutional Layer
In the previous step, convolving we reduced the dimension from 32 to 28
7
7
3
3
7x7 image3x3 filter
→ Produce a 5x5 output
In order to keep the dimensionality (keep to a very depth neauralnetwork), we can add a frame, called padding
Convolutional Layer
9
3
9x9 (7x7 image + 1 frame of padding)
3x3 filter
The size of padding depend on the size of the filter
→ Produce a 7x7 output
Assume we want to reduce the size of the output of the convolution layer, we can use the stride parameter
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
93
Convolutional Layer
9x9 (7x7 image + 1 frame of padding)
3x3 filter
Stride 2
9
3
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
9 [N]3 [F]
→ Produce a 4x4 output
In general:
𝐎𝐮𝐭𝐩𝐮𝐭 =𝑵− 𝑭
𝒔𝒕𝒓𝒊𝒅𝒆+ 𝟏
Convolutional Layer
The Conv Layer:• Accepts a volume of size W1 x H1 x D1
• Requires 4 hyperparameters:• Number of filters K• Their spatial extent F• The amount of zero padding P• The stride S
• Produce a volume of size W2 x H2 x D2 where:
• 𝑊2 =𝑊1−𝐹+2𝑃
𝑆+ 1
• 𝐻2 =𝐻1−𝐹+2𝑃
𝑆+ 1
• 𝐷2 = 𝐾• It introduces 𝐹 ∙ 𝐹 ∙ 𝐷1 weights per filter, for a total of (𝐹 ∙ 𝐹 ∙ 𝐷1) ∙ 𝐾
weights (10 filter with a dimension of 5x5 on an RGB image will have 5 ∙ 5 ∙ 3 ∙ 10 = 𝟕𝟓𝟎 parameters)
Common settings: • K = (powers of 2, e.g. 32, 64, 128, 512) • F = 3, S = 1, P = 1 • F = 5, S = 1, P = 2 • F = 5, S = 2, P = ? (whatever fits) • F = 1, S = 1, P = 0
Brain / Convolutional Layer
28
28
1
Filter: 5x5x3 Convolution
result
Activation Map
32
32
3
This is like a single neuron with local connectivity
An activation map is a 28x28 sheet of neuron outputs:1. Each is connected to a small region in the
input 2. All of them share parameters
28 “5x5 filter” -> “5x5 receptive field for each neuron”
Brain / Convolutional Layer
28
28
5
32
32
3
Using 5 filters, we are stacking neurons in a matrix 28x28x5.
This means that, somehow, 5 different neurons are looking at the same piece of image and producing an output.
28
Activation layer
Done by convolution
Done by activation function
Activation layer
Sigmoid
𝒚 𝒙 =𝟏
𝟏 + 𝒆−𝒙
Hyperbolic tang.𝒚 𝒙 = 𝒕𝒂𝒏𝒉(𝒙)
ReLU𝒚 𝒙 = 𝒎𝒂𝒙(𝟎,𝒙)
Leaky ReLU𝒚 𝒙 = 𝒎𝒂𝒙(𝟎.𝟏𝒙, 𝒙)
ELU
𝒚 𝒙 = ቊ𝒙
𝜶(𝒆𝒙 − 𝟏)𝒙 ≥ 𝟎𝒙 < 𝟎
Pooling layer
This layer reduce the spatial size of the representation of the image. It aims to reduce the amount of parameters and computation in the network, and hence to also control overfitting
MAX POOLING2x2 filter
Stride of 2
The idea is to reduce the size keeping the neuron “more activated”
Pooling layer
• Accepts a volume of size W1 x H1 x D1
• Requires two hyperparameters:
• their spatial extent F,• the stride S,
• Produces a volume of size W2 x H2 x D2 where:
• 𝑊2 =𝑊1−𝐹
𝑆+ 1
• 𝐻2 =𝐻1−𝐹
𝑆+ 1
• 𝐷1 = 𝐷2
• Introduces zero parameters since it computes a fixed function of the input
• Different version of pooling exist. The most used is MAX pooling, then you have AVG
pooling..
• Pooling can be obtained even through Conv Layer with big stride
Fully connected layer
These layers are just like the classic neural network layers.
All the inputs are connected to the outputs
Fully connected layer
In the CNNs they have the task to “classify” the high level features extracted by the previous layers
Image 32x32x3Stretched to 3072x1
3072
1
Output can be the number of classes (e.g. 10)
10
1Wx10x3072 weights
This will be the layer involved in the concept of “Transfer Learning” (more on this later)
CNN – AlexNet – Recap
Conv 1
11
x11
x3x9
6 s
trid
e 4
Conv 2
5x5
x96
x25
6
Conv 5
3x3
x38
4x2
56
Conv 4
3x3
x38
4x3
84
Conv 3
3x3
x25
6x3
84
FC
8
FC
7
FC
6
OU
TP
UT
Inp
ut Im
age
CNN – AlexNet – Recap
Figure credit: Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014
Features (and filters) are more complex compared to the manual ones
Hierarchical
features
Other networksGoogLeNet. The ILSVRC 2014 winner. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters. (Inception)
VGGNet. The runner-up in ILSVRC 2014. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end.
ResNet. Residual Network was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network (as of May 10, 2016).
Architecture #Params Size Accuracy Year #OPS FW time
[GPU]
FW time
[CPU]
AlexNet 61M 238 MB 80.2 2012 724M 3.1 ms 0.29 s
Inception V1 7M 70 MB 88.3 2014 1.43B - -
VGGNet 138M 528 MB 91.2 2014 15.5B 9.4 ms 4.36 s
ResNet-50 25.5M 99 MB 93 2015 3.9B 11 ms 1.13 s
GPU - Titan X | CPU i7-4790K (4 GHz)
Training Process
1. Create your model
2. Choose the Activation Functions (use ReLU)
3. Data Preprocessing (images: subtract mean)
4. Weight Initialization
5. Use Batch Normalization
6. Babysitting the Learning process
7. Hyperparameter Optimization
An Object Detection Pipeline
How to use a CNN on your data
AIM: being able to reuse a CNN for object detection on our own data
1. REUSE A CNN2. OBJECT DETECTION3. OWN DATA
Data gathering
Suppose we want to create a CV software which is able to detect playing cards (for simplicity only 9,10, Jack, Queen , King and Ace)
Since the algorithm is supervised we need a dataset with the label associated
The dataset must be:1. As large as possible (200 items per class at least)2. With the same object in different “conditions” (background, lights)3. With random objects along with the desired object4. It should respect “application condition”. So decide if we need partial objects,
overlapping and so on5. With no label errors6. Not too large (less training time)
You can create the dataset taking pictures (e.g. smartphone) or harvest from Google Images.
Data gathering
LabellingWe need to put an annotation on each image, to explain to the CNN what’s in the image.
We are building an object detection, so the label process will involve the creation of bounding boxes
10
King
King
1. This process is how we transfer the knowledge to the data
2. We can use a lot of open-source software (LabelImg→https://github.com/tzutalin/labelImg)
3. It will create an XML file associated to the image
<object>
<name>ten</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>145</xmin>
<ymin>68</ymin>
<xmax>303</xmax>
<ymax>225</ymax>
</bndbox>
</object>
Object detection
Our dataset is ready, we need now to search a model for Object detection.
So far we learned how to do Image Classification, we can use the same models and slide a window over the image.
CONS: different shape of the window in different position. Huge amount of time.
Object detection – Classification based
Region proposal: run the CNN only on part which can contain an object
R-CNN
Further improvements leaded to the creation of:• Fast-RCNN• Faster-RCNN
They are also called two-stage algorithms
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014
Object detection – Regression based
NO Region proposal: run the CNN in “pre-defined” areas
Divide image into grid 7 x 7
Imagine you have B base boxes for each grid cell
The CNN will regress from base boxes to a new tensor
(dx, dy, dh, dw, confidence)
Output of these network are a big tensor
7 x 7 x (B * (5 + C))C - classes
Object detection – Regression based
NO Region proposal
Classes: C=2 [cat – dog]Boxes: B = 3
Output: 7x7x21
(confidence[0.85], bx, by, bh, bw, cat[0], dog[1])
The most famous structures are:• YOLO (You Only Look Once)• SSD (Single Shot Detection)
They are also called one-stage algorithms and they’ve been built with the aim of obj detection
They are faster compared to other methods, but even less accurate.
Object detection
Once we decided the model, we can download the structure and the weights.
https://github.com/tensorflow/models/blob/master/resea
rch/object_detection/g3doc/detection_model_zoo.md
These models already have the structure of the CNN implemented.
The weights we download, however, are trained on some other dataset (COCO, Pascal, Kitti…)
How to use these huge network already trained for our problem?
The solution is transfer learning
Transfer learningTraining a CNN from zero requires a very big dataset and a very expensive hardware
RetrainFreeze
Already trained Small dataset Big dataset
#Class #Class
It re-uses the ability learnt in another task
Pre trained on ImageNet
Object detection pipeline
Create your own Dataset
Choose your network structure
(Faster-RCNN / SSD..)
Modify the Fully Connected layers
Do a «trasnfer training»
Detect your objects
Object detection examples
YOLO – 1 class SSD – 5 classes
Calibrated
cropping
Persons Detection
R-FCN
Face Extraction
D-Lib
Age/Gender estim.
VGG-16 + DC layer
W –
(25,35)
M –
(38,43)
Object detection examples
Bibliografia
• https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html• https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv• http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-
networks.pdf• https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-
Multiple-Objects-Windows-10• https://project.inria.fr/deeplearning/files/2016/05/DLFrameworks.pdf [for in depth
comparison of DL frameworks]