Page 1
Convolutional Patch Representations for Image Retrieval: an Unsupervised Approach
29th Mar 2016
Original slides by Eva MohedanoInsight Centre for Data Analytics (Dublin City University
Mattis Paulin, Julien Mairal, Matthijs Douze, Zaid Harchaoui, Florent Perronnin, Cordelia Schmidt
Page 2
OverviewPublished ICCV 2015 (A.K.A. Local Convolutional Features With Unsupervised
Training for Image Retrieval)
Deep Convolutional Architecture to produce patch-level descriptors
• Unsupervised framework
• Comparison in patch and retrieval datasets
• “RomePatches” dataset
Page 3
Related Work
• Shallow patch descriptors
• Deep learning for image retrieval
• Deep patch descriptors
Page 4
Related Work• Shallow patch descriptors
SIFT – Scale-Invariant Feature Transform
- stereo matching
- retrieval
- classification
SURF, BRIEF, LIOP, (…)
Hand crafted → Relatively small number of parameters.
Note: A patch is an
image region extracted
from an image.
Page 5
Related Work• Deep learning for image retrieval
CNN learned on a sufficiently large labeled dataset (ImageNet) generates intermediate layers that
can be used as image descriptors.
Those descriptors work for a wide variety of tasks, including image retrieval
Page 6
Related Work• Deep learning for image retrieval
source image: http://pubs.sciepub.com/ajme/2/7/9/
Page 7
Related Work• Deep learning for image retrieval
source image: http://pubs.sciepub.com/ajme/2/7/9/
Fully connected layers → Global Image Descriptors
● Compact representation
● lack of geometric invariance
Below state-of-the art in image
retrieval
Compute at different scales(Babenko, Razavian)
Page 8
Related Work• Deep learning for image retrieval
source image: http://pubs.sciepub.com/ajme/2/7/9/
Convolutional layers
Page 9
Related Work• Deep patch descriptors
3 different kind of supervision:
1. Category labels of ImageNet. [Long et al, 2014]
2. Surrogate patch labels: Each class is a given patch under different transformations [Fischer et al, 2014]
3. Matching/non-matching pairs. [Simo-Serra et al, 2015]
Works focussed in patch-level metrics, not image retrieval.
All approaches requiered some kind of supervision.
Page 10
Image Retrieval Pipeline• Interest point detection
Hessian-Affine detector.
Rotation invariance.
• Interest point description
Feature representation in a Euclidean space
• Patch Matching
VLAD encoding.
Power normalization with exponent 0.5 + L2-norm.
Page 11
Image Retrieval Pipeline• Interest point detection
Hessian-Affine detector.
Rotation invariance.
• Interest point description
Feature representation in a Euclidean space
• Patch Matching
VLAD encoding.
Power normalization with exponent 0.5 + L2-norm.
Page 12
Convolutional DescriptorsPatch size = 51x51 – Optimal for SIFT on Oxford dataset.
CNN extended to retrieval by:
• Encoding local descriptors with model trained with an unrelated classification task
• Devising a surrogate classification problem that is as related as possible to image retrieval:
• Using unsupervised learning: Convolutional Kernel Network
Page 13
Convolutional Descriptors• Using unsupervised learning: Convolutional Kernel Network
Feature representation based in a kernel (feature) map -- Data independent
Page 14
Convolutional Descriptors• Using unsupervised learning: Convolutional Kernel Network
Projection in Hilbert space
Explicit kernel map can be computed to approximate it for computational efficiency.
- Sub-sample of patches
- Stochastic Gradient Optimization
Page 15
Convolutional Descriptors• Using unsupervised learning: Convolutional Kernel Network
4 possible inputs
From left to right: CKN-raw, CKN-mean subs, CKN-white (mean subs + PCA-whitening), CKN-grad (fully invariant to color)
Only CKN-raw, CKN-white and CKN-grad are evaluated.
Page 16
ExperimentsDatasets:
1. Rome Patches-Image
2. Oxford
3. UKbench and Holidays
CKN trained on 1M sub-patches. 300K iterations. Mini-batches size of 1000.
Page 18
Conclusions• CKN offer similar and sometimes better performance than CNN in the
context of patch description.
• Good patch retrieval translates into good image retrieval.
• CKNs are orders of magnitude faster to train than CNNs (10 min vs 2-3 days
on a modern GPU)
• Fully unsupervised – no labels.
Page 19
ResourcesRomePatches+Code (Although code is not accessible!)
Discriminative Unsupervised Feature Learning with Exemplar Convolutional
Neural Networks
- Code with augmentations in matlab
- Code for training models.
- Models already trained :-)
Triplet’s net + Code !!
- Greyscale local patches of 32x32. Tested in matching datasets