SIFT featuresfilestore.nationalarchives.gov.uk/resources/temp/...Mar 22, 2018 · Trialed at 4 HEIs 120k artworks in VADS.ac.uk + ... lassic omputer Vision Matches visual features

IMLG – 22 March 2018 John Collomosse 2

Visual Search?

Visual search: Querying visual lrepositories using visual (pictorial) queries.

80% of the Internet forecast to be visual data by end 2018 [Cisco, NF 2016] (on track for 84%)

iTrace – Visual Plagiarism Detection

A visual ‘TurnItIn’

Trialed at 4 HEIs

120k artworks in VADS.ac.uk + uploads

John Collomosse 3

“Classic” Computer Vision

Matches visual features (“key points”) between images

SIFT features

(circa 2004)

IMLG – 22 March 2018

John Collomosse 4

“Modern” Computer Vision

Use deep neural networks trained on example data to extract digital signatures from whole images

Convolutional Neural Network

CNN


John Collomosse 5

Deep Learning Revolution

Pre 2012 – State of the art…. Deep Learning (~2016) – State of the art….

Q: What is this? (Which of 1000 objects is this?) A: Mug (~9% accuracy)

Q: How many leftover donuts are there? A: Three (~70% accuracy)


John Collomosse 6

Deep Learning Revolution

Challenges for CNNs circa 2012:

- Data hungry. CNNs require a lot of training data.

- Processing power. CNNs require a lot of CPU to train. So, only simple CNNs were trained.

- Niche.

Then…

- ImageNet arrived (16m images, 1000 classes) [Deng et al. 2009]

- GPUs. General purpose GPU processing / CUDA. The algorithms for training a CNN are highly parallelisable.

- NIPS/ECCV 2012. Double-digit % gain on ImageNet accuracy announced using CNNs.

The vision community took notice!


Sketch based Visual Search

John Collomosse 7

1. Several Million (10^7) Colour Images

Sketch based Retrieval of….

2. Images using Deeply Learned Descriptors

3. Sketching with Style: Search with Aesthetic Constraints


John Collomosse 8

Why Sketch?

“Most of the next generation will probably never use Desktop products. People don’t understand how profound a shift this is. The reality is for these hundreds of millions of users, mobile will be their entire gateway to services.” – Wired, 2017

• Touch screen (gesture) is the primary interface on mobile (replacing text/keyboard) • New discovery tools needed to release value in visual content • Sketch is an intuitive modality for describing desired visual attributes


(want to invest time to)

But the problem…

John Collomosse 9

“People don’t draw well!”

Sketching is visual communication

[1] Hu and Collomosse “Performance Evaluation of Gradient Field HOG” Comp. Vision. Image Understanding (CVIU) 2013.

Excerpt of Flickr15k [1]

Humans communicate efficiently, using vocabulary & context

Sketch for retrieval is a casual throw-away act (for a machine).

(… and some users are bad at sketching)


Demo

John Collomosse 10

Android demo app available for phones/tablets at: https://play.google.com/store/apps/details?id=com.collomosse.sketcher


Diversion into Text Search

John Collomosse 11

A common measure of text document similarity involves building a frequency histogram of the words in the document: a “bag of words”

Ma

rtian

s

eve

the

mo

lten

Life

Lig

ht

and

by

This descriptor encodes the distribution of words in the document; a function of its content

Careful choice of the words (bins) is key! Location of words doesn’t matter!


Bag of Visual Words for Photo Retrieval

John Collomosse 12

Q. Why does BoVW find a swan?

A. The frequency / distribution of local texture patches cut from the query (SIFT) matches those cut from swan images in the database.

query

database


Bag of Visual Words for Sketch Retrieval

John Collomosse 13

Q. Why do we see a swan?

A. The spatial relationships of strokes (edges) determine the object’s structure, from which we infer presence of a swan.


Synthesising “Texture”

John Collomosse 14

Photos Sketches


Feature Extraction (sketch & photo)

Photographs passed through edge detection filter

Multi-scale patches cut at every ‘edge’ or ‘sketch’ pixel (Gradient information / HOG)

Database images Feature

extraction

Feature

encoding

Index

file

Matching

Feature

extraction

Feature

encoding

Query

sketch

results

Edge

detection

John Collomosse 15 IMLG – 22 March 2018

Results: ImageNet Dataset:

~2s / query

16m image dataset.

Platform:

AMD 2.6Ghz

Single core benchmark

43Gb features



John Collomosse 17






Problems with an Edgemap Approach

Sketches are not edgemaps (Distortion, Level of Abstraction, etc.)

John Collomosse 18

House

Crocodile

TU-Berlin dataset, Eitz et al. 2012


Cross domain metric learning for SBIR

Can we learn a low dimensional metric embedding of edge and sketch space?

John Collomosse 19

Before learning After learning

sketch image


Sketch matching with CNN

John Collomosse 20

L(a,p,n)

𝐿 𝑎, 𝑝, 𝑛 = 1

2 {max (0, 𝑚 +

+ 𝑎 − 𝑝 22 − 𝑎 − 𝑛 2

2}

512

512 100

256 15 15

256 15 15

3 3

256 15 15

3 3

128 31 31

3 3

64 71

71

5 5

512

512 100

256 15 15

256 15 15

3 3

256 15 15

3 3

128 31 31

3 3

64 71

71

5 5

a p

512

512 100

256 15 15

256 15 15

3 3

256 15 15

3 3

128 31 31

3 3

64 71

71

5 5

n

m: margin

Triplet loss in triplet network:


What happens during training?

Learning joint embedding of edge and sketch space

John Collomosse 21

Before training

a

p

n

a p

n

m

Triplet loss

𝐿 𝑎, 𝑝, 𝑛 = 1

2 {max (0, 𝑚 +

+ 𝑎 − 𝑝 22 − 𝑎 − 𝑛 2

2}


Datasets

John Collomosse 22

Training: • Sketch: TU-Berlin 20k@250 classes. • Image: Internet photo acquisition

Test: • Flickr15k: 15k photos + 330 sketches @ 33 classes.

Class diversity between training and test datasets:

TU-Berlin Flickr15k

“bridge” “bridge” “Tower bridge”

“Sydney bridge”

“Oxford bridge”

“duck” “swan” “duck-swan” “duck-swan”

“sun” “moon” “moon” “sunrise- sunset”

Human-skeleton nose mermaid angel


Training Methodology

Data Augmentation and Triplet Formation

John Collomosse 23

• Images: • 25k photos: 100 photos/class. • Edge extraction: gPb [Arbelaez, 2011]. • Mean subtraction, random crop/rotation/scaling/flip.

• Sketches: • 20k sketches: 20s training, 60s validation per class. • Skeletonisation. • Mean subtraction, random crop/rotation/scaling/flip. • Random stroke removal.

• Triplet formation: • Random selection pos/neg samples.

• Training: • 10k epochs.

crop

rotation

scaling

flip

Stroke removal


Representative Queries/Results



John Collomosse 25






Sketching with Style

1

2

1. User’s intermediate work in Photoshop (here, a graphite sketch)

2. Behance visually searched for inspiration in a specified style (here, watercolor)

Result searching 66.8m Behance images


Video Demo


Learning the Style Embedding

GoogleNet (Inception v3) with 128-D Bottleneck

Triplet design, fully siamese

Training Set (110k Behance)


Visualizing the Style Embedding (Behance 1m test set)

t-SNE perplexity 20


Putting it all together Two Stream Network Architecture:

1. A Structure Network – that learns an embedding to visually match structure irrespective of style

2. A Style Network – that learns an embedding to visually match aesthetics irrespective of content (structure)

25

6

Structure embedding

Style embedding

Search Index

12

8

12

8

Structure network: Sketch

Style network 12

8

12

8

Structure network: Image

Style network

25

6


Evaluating Sketch+Style Retrieval (top-1 result)


Fine Grain: Style Analogies Vector math in 128-D style space

= (watercolor + graphite) = (watercolor – graphite)



= comic = (comic – pen+ink)



= vectorart = (– vectorart)


Closing Thoughts

Scalability - Scaling under sketch ambiguity is the challenge (not compute) - Need to integrate modalities beyond shape. Which? How to fuse? - How to determine user intent in prioritizing modalities?

Composition Breakdown

- All SBIR assumes a single dominant object, but real data isn’t like that

John Collomosse 35

Deep Learning - Deep Learning outperforms classic approaches at perceptual tasks like search - But they must be trained with a lot of representative data (+annotation) - Sketch data in particular is sparse (10^2 categories, 10^4 instances)


SIFT featuresfilestore.nationalarchives.gov.uk/resources/temp/...Mar 22, 2018 · Trialed at 4 HEIs 120k artworks in VADS.ac.uk + ... lassic omputer Vision Matches visual features

Documents