Top Banner
Weakly Supervised Correspondence Estimation Zhiwei Jia

Weakly Supervised Correspondence Estimation

Dec 18, 2021



Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
CSE291 - Copy-2Weekly Supervised Learning
• 1. Not enough labeled data • 2. Transfer learning • 3. Help to increase performance of supervised learning • 4. To provide good insights on solving certain learning problems
Learning to See by Moving
• 1. Biological background • 2. Why use egomotion information as supervision?
• Availability of “labeled data” • 3. Overview:
• Egomotion information as a form of self-supervision • 4. Main result:
• Learned visual representation compared favourably to that learnt using direct supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching
Main Approach
• 1. Correlating visual stimuli with egomotion: • Egomotion < == > camera motion • Predicting the camera transformation from the consequent pairs of image.
• 2. Visual correspondence can help for visual tasks in general: • Pretraining for other tasks
Architecture Overview
3. TCNN only used in training process
Compared to SFA Training
• xt1 , xt2 refer to feature representations of frames observed at times t1, t2 respectively.
• D is a measure of distance with parameter.
• m is a predefined margin and
• T is a predefined time threshold
Training of this Network
• MNIST • KITTI • SF dataset
• 3. Trained Network used for further visual tasks • KITTI-Net • SF-Net
Samples from SF/KITTI Dataset
On MNIST • Translation:
• integer value in the range [-3, 3] • X, Y axes • binned into seven uniformly spaced bins
• Rotation: • lie within the range [-30 , 30 ]. • Z axe • binned into bins of size 3 each resulting into a total of 20 bins
• SFA: • translation in the range [-1, 1], rotation within [-3 , 3 ]
• 5 million image pairs
• 1. Camera direction as Z axis • 2. Image plane as XY plane.
• 3. Translations along the Z/X axis • 4. Rotation about the Y axis (Euler angle) • 5. Individually binned into 20 uniformly spaced bins each.
• The training image pairs from frames that were at most ±7 frames apart
SF Dataset
• Constructed using Google StreetView (≈ 130K image). • Camera transformation along all six dimensions of transformation. • Rotations between [-30 , 30 ] were binned into 10 uniformly spaced bins and two extra bins were used for rotations larger and smaller.
• Three translations were individually binned into 10 uniformly spaced bins each.
Evaluation on MNIST
1. Learned Base-CNN served as a pretraining method for ConvNet on classification of MNIST
2. small amount of data.
3. Learned feature representation increases the performance of classification tasks.
Evaluation of KITTI- / SF-Net
• Measured in terms of further performing these visual tasks • 1. Scene classification • 2. Large Scale Image Classification • 3. Keypointmatching • 4. Visual odometry
• estimating the camera transformation between image pairs.
Scene Classification on SUN dataset
• 397 indoor/outdoor scene categories • provides 10 standard splits of 5 and 20 training images per class and a standard test set of 50 images per class
• Compare KITTI/SF Net with: • 1. AlexNet pretrained on ImageNet • 2. GIST • 3. SPM
1. KITTI-Net outperforms SF-Net and is comparable to AlexNet-20K.
2. Performance from layer 4, 5 features of KITTI-Net outperform layer 4, 5 features of KITTI-SFA-Net
Large Scale Image Classification
• All layers of KITTI-Net, KITTI-SFA-Net and AlexNet-Scratch (i.e. CNN with random weight initialization) were finetuned for image classification.
• Comparison of AlexNet using pretrained KITTI Net vs. AlexNet trained from scratch
Keypoint Matching (intra-class)
• PASCAL-VOC2012 dataset with Ground-truth object bounding boxes (GT- BBOX)
• 1. Compute feature maps from layers 2-5 • 2. Matching score for all pairs of GT-BBOX in the same object class.
• the features associated with keypoints in the first image were used to predict the location of the same keypoints in the second image.
• 3. Error measurement of matching: • The normalized pixel distance between the actual and predicted keypoint locations
Details of Keypoint Matching
Comparison Result
1. KITTI-Net-20K was superior to AlexNet-20K and AlexNet-100K and inferior only to AlexNet-1M. 2. AlexNet-Rand surprisingly performed better than AlexNet-20K.
Visual Odometry
• All layers of KITTI-Net and AlexNet-1M were finetuned for 25K iterations using the training set of SF dataset on the task of visual odometry.
Weakness, Limitation & Extension

Learning Dense Correspondence via 3D-guided Cycle Consistency • 1. Background:
• Works in intra-class correspondence estimation via deep learning vs. works for computing correspondence across different object/scene instances.
• Lack of data for dense correspondence • 2. Naïve solution: trained on 3D rendered model
Main Approach
• Utilize the concept of cycle consistency of correspondence flows: • the composition of flow fields for any circular path through the image set should have a zero combined flow.
• “meta-supervision” • End-to-end trained deep network for dense cross-instance correspondence that uses the widely available 3D CAD models.
Consistency in the Sense of Flow Field
• Predict a dense flow (or correspondence) field F(a,b) : R^2 → R^2 between pairs of images a and b.
• The flow field F(a,b)(p) = (px−qx, py−qy) computes the relative offset from each point p in image a to a corresponding point q in image b.
Consistency in the Sense of Matchability
• Why matchability? • A matchability map M(a,b) : R^2 → [0, 1] predicting if a correspondence exists, M(a,b)(p) = 1, or not M(a,b)(p) = 0.
Consistency as Supervision
• While we do not know what the ground-truth is, we know how it should behave.
• Specifically, for each pair of real training images r1 and r2, find a 3D CAD model of the same category, and render two synthetic views s1 and s2 in similar viewpoint as r1 and r2, respectively.
• Each training quartet < s1, s2, r1, r2 >
Aim to learn 2D image correspondences that potentially captures the 3D semantics.
Compared to Autoencoder
• Reconstruction vs. zero net flow • Sparsity constrants vs. use construction of Flow Field from s1 to s2 as guidance
Loss function for Learning Dense Correspondence
Loss function for Learning Dense Matchability
Combined loss function is:
A Small Issue for Learning Matchability
• Multiplicative composition. • Could fix M(s1,r1) = 1 and M(r2,s2) = 1, and only train the CNN to infer M(r1,r2)
End-to-end Differentiable by Continuous Approximation • Bilinear interpolation over the CNN predictions on discrete pixel locations.

Overall Architecture
Training Process
• Data: • The 3D CAD models used for constructing training quartets come from the ShapeNet database, while the real images are from the PASCAL3D+ dataset.
• 1. First initialize the network (partly) to mimic SIFT flow: • minimize the Euclidean loss between the network prediction and the SIFT flow output on the sampled pair.
• 2. Then fine-tune the whole network end-to-end to minimize the combined consistency loss
Evaluation of Learning Performance
• 1. Feature visualization • 2. Keypoint transfer • 3. Matchability prediction: • 4. Shape-to-image segmentation transfer
Feature Visualization
• Extract conv-9 features from the entire set of car instances in the PASCAL3D+ dataset, and embed them in 2-D with the t-SNE algorithm.
• The result indicates that viewpoints is an important signals for similarities in the learned network
Keypoint Transfer
• Compute the percentage of correct keypoint transfer (PCK) over all image pairs as the metric for measuring the performance.

• Shape-to-image correspondence for transfering per-pixel labels (e.g. surface normals, segmentation masks, etc.) from shapes to real images.
• 1. Construct a shape database of about 200 shapes per category, with each shape being rendered in 8 canonical viewpoints.
• 2. Given a query real image, apply the network to predict the correspondence between the query and each rendered view of the same category, and warp the query image according to the predicted flow field.
• 3. Compare the HOG Euclidean distance between the warped query and the rendered views, and retrieve the rendered view with minimum distance.
Limitation & Extension