Recurrent Transformer Networks for Semantic Correspondence Seungryong Kim, Stephen Lin, Sangryul Jeon, Dongbo Min, and Kwanghoon Sohn Neural Information Processing Systems (NeurIPS) 2018 Semantic Correspondence • Establishing dense correspondences between semantically similar images (different instances within the same object) Introduction Background Recurrent Transformer Networks Experimental Results and Discussion Challenges in Semantic Correspondence • Photometric/geometric deformations, lack of supervisions Problem Formulation • Given a pair of image and , infer a fields of affine transformations for each pixel = , that maps pixel to ′ =+ Intuition of RTNs Network Configuration Feature Extraction Networks • To extract features and , input images and are passed through convolution networks with parameters such that = (| ) using CAT-FCSS, VGGNet (conv4-4), ResNet (conv4-23) Recurrent Geometric Matching Networks • Constraint correlation volume ( , ( )) =< , ( ) >/ < , ( )> 2 • Recurrent geometry estimation − −1 = (( , ( −1 ))| ) Weakly-supervised Learning • Intuition: Matching score between the source feature at each pixel and the target feature ( ) should be maximized while keeping the scores of other transformation candidates low , =− ∈ ∗ log(( , ( ))) where the function ( , ( )) is a Softmax probability ( , ( )) = exp(( , ( ))) ∈ exp(( , ( ))) where ∗ denotes a class label defined as 1 if = , 0 otherwise Ablation Study • RTNs converges in 3-5 iterations • Accuracy improves until window 9×9, but larger window sizes reduce accuracy Results on TSS Benchmark Results on PF-WILLOW/PF-PASCAL Benchmarks Methods for geometric invariance in the regularization steps Geometric matching methods [Rocco’17,’18] Inference using source/target images is learned w/ ∗ using self- or meta- supervision Methods for geometric invariance in the feature extraction steps STN-based methods [Choy’16, Kim’18] is learned wo/ ∗ is learned w/ ∗ Inference based only source or target image Recurrent Transformer Networks (RTNs) Weaves the advantages of both existing STN-based methods and geometric matching methods! Source Target DCTM SCNet Gmat. w/Inl RTNs Source Target CAT- FCSS SCNet Gmat. w/Inl RTNs ResNet feature exhibits the best performance! Fine-tuned features show improved accuracy! Learning the feature extraction networks and geometric matching networks jointly can boost accuracy! RTNs has shown the state-of-the-art performance! Project webpage: http://diml.yonsei.ac.kr/~srkim/RTNs