FINE IMAGE SIMILARITY WITH DEEP R ANKINGwangjiangb.github.io/pdfs/deep_ranking_poster.pdf · LEARNING FINE-GRAINED IMAGE SIMILARITY WITH DEEP R ANKING Jiang Wang(Northwestern) Yang

L EARNING F INE -GRAINED I MAGE S IMILARITY WITH D EEPR ANKING

Jiang Wang(Northwestern) Yang Song (Google) Thomas Leung (Google) Chuck Rosenberg (Google) Jingbin Wang(Google) James Philbin (Google) Bo Chen (Caltech) Ying Wu (Northwestern)

PROBLEM

Fine-grained imagesimilarity, for imageswith the same category.It is for image-searchapplication, defined bytriplets.

Query

PositiveN

egative

• image similarities are defined subtle di�er-ence.

• it is more difficult to obtain triplet trainingdata.

• we would like to train a model directly fromimages instead of rely on the hand-craftedfeatures.

A RCHITECTURE

Q P N

Triplet Sampling Layer

....Images

....

Ranking Layer

p i p i-p i

+

f(pi) f(pi ) f(pi)+ -

• a novel deep learning that can learns fine-grained image similarity model directlyfrom images.

• a multi-scale network structure.• a computationally efficient online triplet

sampling algorithm.• high quality triplet evaluation dataset.

R ELATED W ORK

• category-level image similarity: the similar-ities are purely defined by labels.

• classification deep learning models.• pairwise ranking model.

FORMULATION

The similarity of two images P and Q can bedefined according to their squared Euclidean dis-tance in the image embedding space:

D ( f (P ) , f (Q )) = �f (P ) − f (Q )�22 (1)

Triplet-based Objective: r i,j = r (pi , pj ) is pair-wise relevance score.

D ( f (pi ) , f (p+i )) < D ( f (pi ) , f (p−i )) ,

�pi , p+i , p

−i such that r (pi , p+

i ) > r (pi , p−i )

(2)

ti = ( pi , p+i , p

−i ) a triplet. The hinge loss is:

l(pi , p+i , p

−i ) =max { 0, g + D ( f (pi ) , f (p+i )) −

D ( f (pi ) , f (p−i )) }(3)

M ULTI -SCALE A RCHITECTURE

Image

225 x 225

SubSample

SubSample

Convolution

Convolution

4:1

8:1

Max pooling

Max pooling

57 X 5729 X 29

8 x 8 x 96

l2 Norm

alization

Linear Embedding

l2 N

ormalization

8 x 8: 4x4

8 x 8: 4x4

3 x 3: 2x2

7 x 7: 4x4

15 x 15 x 96

4 x 4 x 964 x 4 x 96

30744096

4096

4096

ConvNet

l2 Norm

alization

4096

T RAINING DATA

• ImageNet for pre-training. Category-levelinformation.

• Relevance training data. Fine-grained vi-sual information.

– Golden Feature, good for visual simi-larity but not so good for semantic sim-ilarity, and it is expensive to compute,

O PTIMIZATION• Asynchronized stochastic gradient algo-

rithm.• Momentum algorithm.• Dropout to avoid overfitting

Challenges:

• Cannot enumerate all the triplets, need tosample important triplets.

• Cannot load all the images into memory,need to generate triplets online.

T RIPLET S AMPLINGSampling criteria: we sample more highly rel-

evant images.Total relevance score r i :

r i =�

j :c j = c i ,j �= i

r i,j (4)

• For query image: according to total rele-vance score.

• For positive image: sample images with thesame label as the query image, sampling

probability is P (p+i ) =min { T p ,r i,i + }

Z i.

• For negative image, we have two types ofsamples:

1. in-class negative: we draw in-classnegative samples p−i with the samedistribution as the positive image. Wealso require that the margin betweenthe relevance score r i,i + and r i,i −

should be larger than T r

2. out-of-class negative: drawn uni-formly from all the images in di�erentcategories.

Online triplet sampling: reservoir sampling:Buffers for queries

Image sample

Find buffer of the query

Triplets

Query

Positive

Negative

E XPERIMENTS

Comparison with hand-crafted features:

Method Precision Score-30Wavelet 62.2% 2735Color 62.3% 2935

SIFT-like 65.5% 2863Fisher 67.2% 3064HOG 68.4% 3099

SPMKtexton1024max 66.5% 3556L1HashKPCA 76.2% 6356

OASIS 79.2% 6813Golden Features 80.3% 7165DeepRanking 85 .7% 7004

Comparison of di�erent architectures:

Method Precision Score-30ConvNet 82.8% 5772

Single-scale Ranking 84.6% 6245OASIS on Single-scale Ranking 82.5% 6263Single-Scale & Visual Feature 84.1% 6765

DeepRanking 85 .7% 7004

Comparison of di�erent sampling methods:

0 0.2 0.4 0.6 0.8 16600

6700

6800

6900

7000

7100

7200

Fraction of out−of−class negative samples

Scor

e at

30

weighted samplinguniform sampling

0 0.2 0.4 0.6 0.8 10.83

0.84

0.85

0.86

0.87

Fraction of out−of−class negative samples

Ove

rall

prec

isio

n

weighted samplinguniform sampling

R ANKING E XAMPLES

ConvN

etO

ASISD

eepR

ankingC

onvNet

OASIS

Deep

Ranking

stluseR gniknaRyreuQ

A CKNOWLEDGMENT

The work was done when the first author isworking as an intern at Google.

D ATA

High quality image triplet evaluation dataset:Available athttps://sites.google.com/site/imagesimilaritydata/

FINE IMAGE SIMILARITY WITH DEEP R ANKINGwangjiangb.github.io/pdfs/deep_ranking_poster.pdf · LEARNING FINE-GRAINED IMAGE SIMILARITY WITH DEEP R ANKING Jiang Wang(Northwestern) Yang

Documents