This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
CSE291 - Copy-2Weekly Supervised Learning
• 1. Not enough labeled data • 2. Transfer learning • 3. Help to
increase performance of supervised learning • 4. To provide good
insights on solving certain learning problems
Learning to See by Moving
• 1. Biological background • 2. Why use egomotion information as
• Availability of “labeled data” • 3. Overview:
• Egomotion information as a form of self-supervision • 4. Main
• Learned visual representation compared favourably to that learnt
using direct supervision on the tasks of scene recognition, object
recognition, visual odometry and keypoint matching
• 1. Correlating visual stimuli with egomotion: • Egomotion < ==
> camera motion • Predicting the camera transformation from the
consequent pairs of image.
• 2. Visual correspondence can help for visual tasks in general: •
Pretraining for other tasks
3. TCNN only used in training process
Compared to SFA Training
• xt1 , xt2 refer to feature representations of frames observed at
times t1, t2 respectively.
• D is a measure of distance with parameter.
• m is a predefined margin and
• T is a predefined time threshold
Training of this Network
• MNIST • KITTI • SF dataset
• 3. Trained Network used for further visual tasks • KITTI-Net •
Samples from SF/KITTI Dataset
On MNIST • Translation:
• integer value in the range [-3, 3] • X, Y axes • binned into
seven uniformly spaced bins
• Rotation: • lie within the range [-30 , 30 ]. • Z axe • binned
into bins of size 3 each resulting into a total of 20 bins
• SFA: • translation in the range [-1, 1], rotation within [-3 , 3
• 5 million image pairs
• 1. Camera direction as Z axis • 2. Image plane as XY plane.
• 3. Translations along the Z/X axis • 4. Rotation about the Y axis
(Euler angle) • 5. Individually binned into 20 uniformly spaced
• The training image pairs from frames that were at most ±7 frames
• Constructed using Google StreetView (≈ 130K image). • Camera
transformation along all six dimensions of transformation. •
Rotations between [-30 , 30 ] were binned into 10 uniformly spaced
bins and two extra bins were used for rotations larger and
• Three translations were individually binned into 10 uniformly
spaced bins each.
Evaluation on MNIST
1. Learned Base-CNN served as a pretraining method for ConvNet on
classification of MNIST
2. small amount of data.
3. Learned feature representation increases the performance of
Evaluation of KITTI- / SF-Net
• Measured in terms of further performing these visual tasks • 1.
Scene classification • 2. Large Scale Image Classification • 3.
Keypointmatching • 4. Visual odometry
• estimating the camera transformation between image pairs.
Scene Classification on SUN dataset
• 397 indoor/outdoor scene categories • provides 10 standard splits
of 5 and 20 training images per class and a standard test set of 50
images per class
• Compare KITTI/SF Net with: • 1. AlexNet pretrained on ImageNet •
2. GIST • 3. SPM
1. KITTI-Net outperforms SF-Net and is comparable to
2. Performance from layer 4, 5 features of KITTI-Net outperform
layer 4, 5 features of KITTI-SFA-Net
Large Scale Image Classification
• All layers of KITTI-Net, KITTI-SFA-Net and AlexNet-Scratch (i.e.
CNN with random weight initialization) were finetuned for image
• Comparison of AlexNet using pretrained KITTI Net vs. AlexNet
trained from scratch
Keypoint Matching (intra-class)
• PASCAL-VOC2012 dataset with Ground-truth object bounding boxes
• 1. Compute feature maps from layers 2-5 • 2. Matching score for
all pairs of GT-BBOX in the same object class.
• the features associated with keypoints in the first image were
used to predict the location of the same keypoints in the second
• 3. Error measurement of matching: • The normalized pixel distance
between the actual and predicted keypoint locations
Details of Keypoint Matching
1. KITTI-Net-20K was superior to AlexNet-20K and AlexNet-100K and
inferior only to AlexNet-1M. 2. AlexNet-Rand surprisingly performed
better than AlexNet-20K.
• All layers of KITTI-Net and AlexNet-1M were finetuned for 25K
iterations using the training set of SF dataset on the task of
Weakness, Limitation & Extension
Learning Dense Correspondence via 3D-guided Cycle Consistency • 1.
• Works in intra-class correspondence estimation via deep learning
vs. works for computing correspondence across different
• Lack of data for dense correspondence • 2. Naïve solution:
trained on 3D rendered model
• Utilize the concept of cycle consistency of correspondence flows:
• the composition of flow fields for any circular path through the
image set should have a zero combined flow.
• “meta-supervision” • End-to-end trained deep network for dense
cross-instance correspondence that uses the widely available 3D CAD
Consistency in the Sense of Flow Field
• Predict a dense flow (or correspondence) field F(a,b) : R^2 → R^2
between pairs of images a and b.
• The flow field F(a,b)(p) = (px−qx, py−qy) computes the relative
offset from each point p in image a to a corresponding point q in
Consistency in the Sense of Matchability
• Why matchability? • A matchability map M(a,b) : R^2 → [0, 1]
predicting if a correspondence exists, M(a,b)(p) = 1, or not
M(a,b)(p) = 0.
Consistency as Supervision
• While we do not know what the ground-truth is, we know how it
• Specifically, for each pair of real training images r1 and r2,
find a 3D CAD model of the same category, and render two synthetic
views s1 and s2 in similar viewpoint as r1 and r2,
• Each training quartet < s1, s2, r1, r2 >
Aim to learn 2D image correspondences that potentially captures the
Compared to Autoencoder
• Reconstruction vs. zero net flow • Sparsity constrants vs. use
construction of Flow Field from s1 to s2 as guidance
Loss function for Learning Dense Correspondence
Loss function for Learning Dense Matchability
Combined loss function is:
A Small Issue for Learning Matchability
• Multiplicative composition. • Could fix M(s1,r1) = 1 and M(r2,s2)
= 1, and only train the CNN to infer M(r1,r2)
End-to-end Differentiable by Continuous Approximation • Bilinear
interpolation over the CNN predictions on discrete pixel
• Data: • The 3D CAD models used for constructing training quartets
come from the ShapeNet database, while the real images are from the
• 1. First initialize the network (partly) to mimic SIFT flow: •
minimize the Euclidean loss between the network prediction and the
SIFT flow output on the sampled pair.
• 2. Then fine-tune the whole network end-to-end to minimize the
combined consistency loss
Evaluation of Learning Performance
• 1. Feature visualization • 2. Keypoint transfer • 3. Matchability
prediction: • 4. Shape-to-image segmentation transfer
• Extract conv-9 features from the entire set of car instances in
the PASCAL3D+ dataset, and embed them in 2-D with the t-SNE
• The result indicates that viewpoints is an important signals for
similarities in the learned network
• Compute the percentage of correct keypoint transfer (PCK) over
all image pairs as the metric for measuring the performance.
• Shape-to-image correspondence for transfering per-pixel labels
(e.g. surface normals, segmentation masks, etc.) from shapes to
• 1. Construct a shape database of about 200 shapes per category,
with each shape being rendered in 8 canonical viewpoints.
• 2. Given a query real image, apply the network to predict the
correspondence between the query and each rendered view of the same
category, and warp the query image according to the predicted flow
• 3. Compare the HOG Euclidean distance between the warped query
and the rendered views, and retrieve the rendered view with minimum
Limitation & Extension