L EARNING F INE - GRAINED I MAGE S IMILARITY WITH D EEP R ANKING Jiang Wang(Northwestern) Yang Song (Google) Thomas Leung (Google) Chuck Rosenberg (Google) Jingbin Wang (Google) James Philbin (Google) Bo Chen (Caltech) Ying Wu (Northwestern) P ROBLEM Fine-grained image similarity, for images with the same category. It is for image-search application, defined by triplets. Query Positive Negative • image similarities are defined subtle differ- ence. • it is more difficult to obtain triplet training data. • we would like to train a model directly from images instead of rely on the hand-crafted features. A RCHITECTURE Q P N Triplet Sampling Layer .... Images .... Ranking Layer p i p i - p i + f(p i ) f(p i ) f(p i ) + - • a novel deep learning that can learns fine- grained image similarity model directly from images. • a multi-scale network structure. • a computationally efficient online triplet sampling algorithm. • high quality triplet evaluation dataset. R ELATED W ORK • category-level image similarity: the similar- ities are purely defined by labels. • classification deep learning models. • pairwise ranking model. F ORMULATION The similarity of two images P and Q can be defined according to their squared Euclidean dis- tance in the image embedding space: D ( f ( P ) ,f ( Q )) = f ( P ) - f ( Q ) 2 2 (1) Triplet-based Objective: r i,j = r ( p i ,p j ) is pair- wise relevance score. D ( f ( p i ) ,f ( p + i )) <D ( f ( p i ) ,f ( p - i )) , p i ,p + i ,p - i such that r ( p i ,p + i ) >r ( p i ,p - i ) (2) t i =( p i ,p + i ,p - i ) a triplet. The hinge loss is: l ( p i ,p + i ,p - i ) = max { 0,g + D ( f ( p i ) ,f ( p + i )) - D ( f ( p i ) ,f ( p - i )) } (3) M ULTI - SCALE A RCHITECTURE Image 225 x 225 SubSample SubSample Convolution Convolution 4:1 8:1 Max pooling Max pooling 57 X 57 29 X 29 8 x 8 x 96 l2 Normalization Linear Embedding l2 Normalization 8 x 8: 4x4 8 x 8: 4x4 3 x 3: 2x2 7 x 7: 4x4 15 x 15 x 96 4 x 4 x 96 4 x 4 x 96 3074 4096 4096 4096 ConvNet l2 Normalization 4096 T RAINING DATA • ImageNet for pre-training. Category-level information. • Relevance training data. Fine-grained vi- sual information. – Golden Feature, good for visual simi- larity but not so good for semantic sim- ilarity, and it is expensive to compute, O PTIMIZATION • Asynchronized stochastic gradient algo- rithm. • Momentum algorithm. • Dropout to avoid overfitting Challenges: • Cannot enumerate all the triplets, need to sample important triplets. • Cannot load all the images into memory, need to generate triplets online. T RIPLET S AMPLING Sampling criteria: we sample more highly rel- evant images. Total relevance score r i : r i = j : c j = c i ,j = i r i,j (4) • For query image: according to total rele- vance score. • For positive image: sample images with the same label as the query image, sampling probability is P ( p + i )= min { T p ,r i,i + } Z i . • For negative image, we have two types of samples: 1. in-class negative: we draw in-class negative samples p - i with the same distribution as the positive image. We also require that the margin between the relevance score r i,i + and r i,i - should be larger than T r 2. out-of-class negative: drawn uni- formly from all the images in different categories. Online triplet sampling: reservoir sampling: Buffers for queries Image sample Find buffer of the query Triplets Query Positive Negative E XPERIMENTS Comparison with hand-crafted features: Method Precision Score-30 Wavelet 62 . 2% 2735 Color 62 . 3% 2935 SIFT-like 65 . 5% 2863 Fisher 67 . 2% 3064 HOG 68 . 4% 3099 SPMKtexton1024max 66 . 5% 3556 L1HashKPCA 76 . 2% 6356 OASIS 79 . 2% 6813 Golden Features 80 . 3% 7165 DeepRanking 85 . 7 % 7004 Comparison of different architectures: Method Precision Score-30 ConvNet 82 . 8% 5772 Single-scale Ranking 84 . 6% 6245 OASIS on Single-scale Ranking 82 . 5% 6263 Single-Scale & Visual Feature 84 . 1% 6765 DeepRanking 85 . 7 % 7004 Comparison of different sampling methods: 0 0.2 0.4 0.6 0.8 1 6600 6700 6800 6900 7000 7100 7200 Fraction of out−of−class negative samples Score at 30 weighted sampling uniform sampling 0 0.2 0.4 0.6 0.8 1 0.83 0.84 0.85 0.86 0.87 Fraction of out−of−class negative samples Overall precision weighted sampling uniform sampling R ANKING E XAMPLES ConvNet OASIS Deep Ranking ConvNet OASIS Deep Ranking s t l u s e R g n i k n a R y r e u Q A CKNOWLEDGMENT The work was done when the first author is working as an intern at Google. D ATA High quality image triplet evaluation dataset: Available at h ttps://sites.google.com/site/imagesimilaritydata/