At training time: ∙ Inputs: - Synthetically warped image pair - ground truth transf. ∙ Output: Estimated parametric transformation Convolutional neural network architecture for geometric matching Ignacio Rocco 1,2 Josef Sivic 1,2,3 Relja Arandjelović 1,2,* Model 1 DI ENS, École normale supérieure, PSL Research University 3 CIIRC, CTU in Prague 2 INRIA Goal Challenges Contributions Overview Feature extraction CNN I A f A Feature extraction CNN I B f B W Matching f AB Regression CNN I B Warp A I B Matching θ Aff ˆ Feature Extraction I A Feature Extraction Matching TPS Regression Feature Extraction Feature Extraction θ TPS ˆ Affine Regression I ∙ Instance-level and category-level image alignment ‣ Output: smooth dense correspondence field Results on the Proposal Flow dataset DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] RANSAC with our features (affine) Ours (affine) Ours (affine + thin plate spline) Ours (affine ensemble + thin plate spline) 20 27 38 29 56 47 49 56 57 Methods PCK (%) Results on the Caltech-101 dataset DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] Ours (affine) Ours (affine + thin-plate spline) 0.74 0.40 0.34 0.77 0.42 0.34 0.75 0.48 0.32 0.77 0.47 0.35 0.78 0.50 0.25 0.79 0.51 0.25 0.82 0.56 0.25 Methods LT-ACC IoU LOC-ERR ∙ Substantial appearance differences ∙ Presence of background clutter ∙ Lack of large annotated real image pair dataset ∙ CNN architecture suitable for category-level image alignment ∙ The model is trainable from synthetically warped image pairs ∙ Matching layer enables generalization to real image pairs At evaluation time: ∙ Input: Real image pair ∙ Output: Estimated parametric transformation ∙ Three stage siamese CNN architecture mimicking the classical matching pipeline 1. Feature extraction CNN: pre-trained VGG-16 model + per-column L2-normalization 2. Matching: correlation layer + per-column L2-normalization 3. Regression CNN: small CNN, trained from scratch 1. Feature extraction conv1 BN1 ReLU1 conv2 BN2 ReLU2 FC 7×7×225×128 5×5×128×64 5×5×64×P 3. Regression correlation layer L2 norm. 2. Matching Output: w×h dense d-dimensional features Output: L2 normalized pairwise correlation tensor (pairwise matching scores) Output: Estimated parameters of geometric transformation Coarse-to-fine matching architecture ∙ The same architecture can be applied with increasing geometric model complexity 1. Coarse alignment using an affine transformation 2. Refined alignment using a thin-plate spline transformation ∙ The final transformation is the composition of both stages *Now at DeepMind Training from synthetic imagery ∙ Training pairs: generated by a real and a synthetically warped image Tokyo StreetView image Synthetic training pair featuring a thin-plate spline transformation , I A I B Source image Coarse alignment (affine) Fine alignment (affine+TPS) Target image 1. Coarse alignment (affine) 2. Fine alignment (thin-plate spline) Insight: L2 normalization penalizes ambiguous matches Insight: The first layer convolutional filters can specialize to detect local neighbourhood consensus Averaged peaks from conv1 filters ∙ Evaluated using annotated keypoints ∙ Metric: Percentage of correct keypoints (PCK) Qualitative results: CNN model Synthetically warped image pair GT transf. ∙ Evaluated using annotated object segmentation masks ∙ Metrics: Label transfer accuracy (LT-ACC), Intersection over union (IoU), Localization error (LOC-ERR) Qualitative comparison to other methods: , Source image Aligned result Target image Image pair DeepFlow [1] GMK [2] SIFT Flow [3] DSP [4] Proposal Flow [5] Our method I B I A I B CNN model Real image pair Warp I A I A I B Insight: The loss computes a pixel distance and can be used with any type of differentiable geometric transformation + crop crop References ∙ Generalization: We show that the method is relatively unaffected by the nature of the training images [1] P. Weinzaepfel, et al. DeepFlow: Large displacement optical flow with deep matching. In Proc. ICCV, 2013 [2] O. Duchenne, et al. A graph-matching kernel for object categorization. In Proc. ICCV, 2011 [3] C. Liu, et al. SIFT Flow: Dense correspondence across scenes and its applications. IEEE PAMI, 2011 [4] J. Kim, et al. Deformable spatial pyramid matching for fast dense correspondences. In Proc. CVPR, 2013 [5] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal Flow. In Proc. CVPR, 2016