peking-university-landmarks-a-context-aware-visual-search-benchmark-database

INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC29/WG11

MPEG2011/m19188 January 2011, Daegu, Korea

Source Peking University, Harbin Institute of Technology, China Status Input Contribution Title Peking University Landmarks: A Context Aware Visual Search Benchmark DatabaseAuthor [email protected], Lingyu Duan

[email protected], Rongrong Ji [email protected], Jie Chen [email protected], Shuang Yang [email protected], Tiejun Huang [email protected], Hongxun Yao [email protected], Wen Gao

1 Introduction

The 93rd MPEG meeting output draft requirements documents (w11529, w11530 and w11531) of Compact Descriptors for Visual Search. To advance this work, this contribution presents our work on establishing context aware visual search benchmark database for mobile Landmark search. In the input contribution m18542 at MPEG 94th Meeting [9], Peking University has proposed a compact descriptor for visual search, which combines location cues to learn a discriminative and compact visual descriptor that is very suitable for mobile landmark search. We believe our practice as well as the benchmark dataset would enhance the use cases and be helpful to identify requirements for Compact Descriptors for Visual Search.

While there are ever growing focuses on mobile visual search in recent years, a comprehensive benchmark database for fair evaluation among different strategies is still missing. In particular, the rich contextual cues in mobile devices, such as GPS information and camera parameters, are left unexploited in the current visual search benchmarks. This contribution introduces a Peking University Landmarks benchmark for the quantitative evaluations of mobile visual search performance with the support of GPS information. It contains over 13179 images organized into 198 distinct landmark locations within the Peking University campus, which is built by 20 volunteers during November and December, 2010. Each location is captured with multiple shot sizes and viewing angles, using both digital cameras and phone cameras, each photo being tagged with rich contextual information in the mobile scenarios. Moreover, this benchmark studies typical quality degeneration scenarios in mobile photographing, including variable resolutions, blurring, lighting changes, occlusions, as well as various viewing angles. Together with this benchmark, we provide the bag-of-visual-words search baselines in the cases of using either spatial or contextual information in returning image ranking. Finally, distractor images are further introduced to evaluate the robustness of visual search methods in the database.

2 Mo

Comingincreasimunitieestablishsearch sthe richbeneficiare left

We photograpplicatevaluate198 lansufficienfocus osystema

3 Be

Scacontainsdigital aDSC-WNIKONfrom miphone respectipair of GPS dedigital ablurringthe voluphone pduring N

Fig. 1. T

otivation

g with the ing interests. Howeverhed benchmscenarios th

h contextualial to refineunexploitedbelieve a

raphing vartions. In thie GPS contendmarks lont real-worln the avail

atic methodo

enchmark

ale and Cs over 1317and phone c

W290, SamsuN COOLPIXmobile phon

3G and LGively. We revolunteers,

evice (HOLand phone cg and shakinunteers comphotos than November a

Two typical

n

explosive ts in compur, state-of-thmark databahat involve l cues, suche solely visud in the exis

real-worldriances is is contributiext assisted ocations wild photograability of cology to ev

k Databa

Constituti79 scene phcameras. Thung TechwiX L12 and ne cameras G Electroniecruited ove

one using LUX M-120camera photng are mor

mpensate ththe digital

and Decemb

l scenarios o

growth ofuter vision,he-art workase, which lots of phot

h as GPS, tiual ranking.sting visual d, context important tion, we intrmobile visu

ithin the Paphing variacontextual caluate the ro

ase Statis

ion: The Photos, organhere are in tin <DigimaCanon IXU(Nokia E7

cs KP500 wer 20 voluntdigital cam

00E) with ttographers ae frequent

heir bad phocamera pho

ber, 2010.

of capturing

f phone ca, multimediks are rarelshould be tographing ime stamp, However, tsearch bencrich bench

to put forwroduce the ual search p

Peking Uniances typicacues to impobustness o

stics

Peking Univnized into19total 6193 pax S830 / KeUS 210 wi72-1, HTC with resoluteers in data

mera and thethem. Theare within 1happeningsoto with a one. All the

g landmark

ameras, moia analysis,y compareddesigned tovariances uand base s

the effectivchmarks. hmark withward mobiPeking Uniperformanceiversity camally for mobprove the vof contextua

versity Lan98 landmarkphotos captuenox S830>ith resolutioDesire, No

ution 640×4a acquisitioe other usinaveraged v0 degrees f

s for mobilenew one, we images in

photos in d

obile visual, and informd among eao target reausing phonetation inforeness and ro

h sufficienle visual siversity Lane. The datasmpus. Our bile phone cisual search

al cue by ad

ndmarks benk locations ured from d>, Canon DIon 2592×19okia 5235, 480, 1600×1n, each landng mobile pviewing angfor both volue phone capwhich thus n the entire

ifferent sho

l search hamation retrach other oal-world moe cameras. Irmation, areobustness o

nt coveragesearch resendmarks beset is collec

benchmarcameras. Wh performan

dding contex

nchmark (Pand captur

digital cameIGITAL IX944) and 6Apple iph

1200 and 2dmark is caphone, withgle variationunteers. Nopturing. In produce mdatabase ar

ots sizes and

as receivedrieval com-

over a well-obile visualIn addition,e extremelyof such cues

e of users’arches andnchmark to

ct from overrk provides

We put morence, with axt distractor

PKUBench)ed via both

eras (SONYXUS 100 IS,

986 photosone, Apple

2048×1536)aptured by ah a portablens between

ote that bothsuch cases,

more mobilere collected

d angles.

d --l ,

y s

’ d o r s e a r.

) h Y , s e ) a e n h , e d

As imediumwhich adegreesdifferenwith resimages

Fig. 2.

Fi

Coscenariomobile

illustrated im shot and cattempt to c respectivel

nt weathers spect to lanof different

The landma

ig.3. The pe

ontextual o is closelyuser’s geo

n Figure 1,close up. Focover 360 dly. The cap(sunny, cloundmark loct landmarks

ark photo don the

ercentages o

Cues: Coy related to ographical l

we captureor each shodegrees frompturing of boudy, etc.) duations are . The perce

istribution bGoogle Ma

of both phon

omparing wrich contex

location can

e photos in ot size, therm the frontoth digital

during Novegiven in Fi

entage of mo

by overlayinap of Peking

ne camera a

with generaxtual informn be levera

three differre are at motal view of camera and

ember and Digure 2. Diobile and ca

ng the locatg University

and digital c

alized visumation on thaged to pre

rent shot sizost 8 directithe landma

d mobile phDecember. Tfferent colo

amera photo

tion point ofy campus.

camera phot

ual search, he mobile pe-filter mos

zes, namelyions in phoark, capturehone photosThe photo dors denote os is given i

f each colle

tos in PKUB

mobile visphone. For st of unrela

y long shot,tographing,

ed every 45s undergoesdistributions

the samplein Figure 3.

ected photo

Bech.

sual searchinstance, a

ated scenes

, ,

5 s s e

h a s

without visual ranking. Over PKUBench, we pay more focus to the use of such contextual cues in facilitating visual search, including: (1) GPS tag (both latitude and longitude); (2) landmark name label; (3) shot size (long, medium, and close-up) and viewpoints (frontal, side, and others) of those photos; (4) camera type (digital camera or mobile phone camera); (5) capture time stamp. We also provide EXIF information: camera setting (focal, resolution).

In addition, we will show the performance improvement of using contextual information by providing baselines that leverage GPS to refine visual ranking. Furthermore, the effects of less precise contextual information are also investigated by adding distractor images by imposing random GPS distortion to the original GPS location of an image.

Scene Diversity: We provide as diverse landmark appearances as possible to simulate the real-world difficulty in visual search. Hence, the volunteers are encouraged to capture both queries and the ground truth photos (for both digital and phone cameras) without any particular intent to avoid the intruding foreground, e.g. cars, human faces, and tree occlusions.

4 Comparing with Related Benchmarks

Zubud Database [2] is widely adopted to evaluate vision-based geographical location recognition, which contains 1,005 color images of 201 buildings or scenes (5 images per building or per scene) in Zurich city, Switzerland.

Oxford Buildings Database [3] contains 5,062 images collected from Flickr by searching for particular Oxford landmarks, with manual annotated ground truth for 11 different landmarks, each represented by 5 possible queries.

SCity Database [4] contains 20, 000 street-side photos for mobile visual search validation in Microsoft Photo2Search system [4]. It is captured along the Seattle urban streets by a car automatically, taken from the main streets of Seattle by a car with six surrounding cameras and a GPS device. The location of each captured photo is obtained by aligning the time stamps of photos and GPS record.

UKBench Database [5] contains 10,000 images with 2,500 objects, containing indoor objects like CD Covers, book set, etc. There are four images per object to offer sufficient variances in viewpoints, rotations, lighting conditions, scales, occlusions, and affine transforms.

Stanford Mobile Visual Search Data Set [6] contains camera-phone images of products, CDs, books, outdoor landmarks, business cards, text documents, museum paintings and video clips. It provides several unique characteristics, e.g. varying lighting conditions, perspective distortion, and mobile phone queries.

Table 1. Brief comparison of related benchmarking databases. Database PKUBench Zubud Oxfold SCity UKBench Stanford Data Scale 13,179 1,005 5,062 20,000 10,000 Images Per Landmark /Object Category

66 5 92 6 4

Mobile Capture √ × × × × √ Categorized shot size, view Angle, landmark/Object Scale

√ × × × Indoor ×

Blurring Query √ × × × × × Context √ × × × × ×

PKaspects:Low qucamera;queries Table 1

5 Ex

Five grevaluate

Ocqueries,

Baqueries.yield wo

Nigphoto qu

Bluand 20 c

AdcontextuPalace (Universthe origdegener

Fig.4. EOcclusi

KUBench : (1) Rich cuality cellp; quantize tand databapresents th

emplar M

roups of exe the real-w

cclusive Q, occluded b

ackground. These areorse results

ght Queruality heavi

urring ancorrespondi

dding Distual informa(note: the lsity) and 20ginal databaration.

Examples ofve, Backgro

Databasecontextual ihone querithe perform

ase caused bhe brief com

Mobile Q

xemplar moworld visual

Query Setby foregroun

d Clutter often captdue to the b

ry Set conily depends

nd Shakining mobile q

tractors iation to visuandmark bu

012 photos fase. We then

f query scenound Clutte

e: Our datainformationes with co

mance degeby cars, peo

mparison of r

Query Sce

obile query search perf

t contains 2nd cars, peo

s Query tured far awbias of othe

ntains 9 mo on the ligh

ng Query queries with

into Dataual search. Wuildings in from PKU, n select 10

narios (Digiers, Blurring

abase is pron to simulatomparison teneration ofople, trees, related benc

enarios

y scenarios formance in

20 mobile qople, and bu

Set contaiway from aer nearby bu

obile phonehting conditi

Set containhout any blu

abase is to We collect aSummer Pathen randolocations (

ital Camera g/ Shaking,

oviding rich te what we to the corref cellphone and nearby

chmarking d

(in total 1n challenging

queries and uildings.

ns 20 mobia landmark, uildings.

e queries anions.

ns 20 mobilurring or sh

evaluate tha distractoralace are viomly assign(30 queries)

versus Moband Night.

query scencan get froesponding queries; (3

y buildings, databasets i

68 queries)g situations

20 corresp

ile queries where GP

nd 9 digital

le queries whaking.

he effects ofset of 6630

isually simied them wi) from PKU

bile Phone)

narios in theom mobile pqueries of 3) Occlusio blurring anin the state-

) are demos (See Figur

ponding dig

and 20 digS based se

l camera qu

with blurring

f applying l0 photos froilar as thoseith the GPSU to evaluat

) (From Top

e followingphones. (2)the digital

ons in bothnd shaking.of-the-art.

onstrated tore 4):

gital camera

gital cameraarch would

ueries. The

g or shaking

less preciseom Summere in Peking

S tagging ofte the mAP

p to Bottom

g ) l h .

o

a

a d

e

g

e r g f P

m:

Lawalkingwith thrmediumsmall on

Small ScMedium

Large Sc

andmark g distances ree scales, s

m scale is 1nes, 75 med

Tabl

cale (0-12m): Scale (12-30

cale (> 30m):

(Examples

Fig. 5. T

Scale: Weof the photsmall, medi2-30 m and

dium ones, a

le. 2. Typic

Sm): C

hLla

s of Differe

The photo v

try to categtographers ium, and lad the large and 60 large

al landmark

Sculpture, stonCourtyard, andhistoric buildinLarge buildingarge object (e.

nt Scales: F

volumes of t

gorize the laaround eac

arge. The tyscale is ov

e ones.

k types of th

ne, pavilions, d small or mngs (smaller flgs, such as lib.g. BoYa Tow

From Top to

three differe

andmark scch assigned ypical distanver 30m. As

hree differen

gates and othemedium sized floor area).. brary, comple

wer).

o Down: Sm

ent landmar

ale by measlandmark l

nce for smas shown in

nt Landmar

ers. buildings, su

ex building, o

mall, Medium

rk scales in

suring the rlocation. W

all scale is 0Figure 5, w

rk scales

uch as office

or a long shot

m, and Larg

PKUBench

range of theWe come up

0-12 m, thewe have 63

buildings,

t of a very

ge)

h

e p e 3

Fina

6 Mo

We proassisted

(1) Bbuild a

ally, we pro

F

obile Vis

ovide severad visual sear

BoW: We eScalable V

ovide more p

Fig.6. Ph

Fig.7. Photo

sual Sear

al visual serch:

extract SIFTocabulary T

photograph

hoto volume

o volume dis

rch Basel

earch baseli

T [7] featurTree [5] to g

details in F

e distributio

stribution b

lines

ines, includ

res from eagenerate the

Figure 6 and

n by differe

by different

ding purely

ach photo, te initial Vo

d Figure 7.

ent shot size

viewing ang

visual sear

the ensemblcabulary V.

es

gles.

rch as well

le of which. The SVT

as context

h is used togenerates a

t

o a

bag-of-wthe branapproxisearch p

(2) functionon the w

where Dand BoWthe BoWis based

It iswhile itsatisfactcontainsbuildingbe well

mAperform

Fig.8. Tcamera

Notehappensto favordegenerperform

words signanching factmate 100,0

performance

GPS + Bon by multipweighting fu

Dis(A,Q) isWDis(A,Q)

W based visd on such sims worth ment typically tory performs lots of trgs (such as distinguish

AP Performmance of eac

The performand mobile

e that mosts in the longr other nearration using

mance when

ature Vi fortor as B. In

000 codewoe, which rev

oW: We fuplying the Gunction as:

(Dis A

s the overal) stands for ual distancemilarity mentioning thagives prom

mance in threes that arancient Chi

hed by RAN

mance withch query sce

mance of oce phone cam

t of occlusig or mediumrby landmag solely Gcombining

r each databn a typical

ords. We usveals its pos

urther leverGPS distanc

, )A Q GeoD

ll distance bthe geogra

e between qeasurement iat we have

mising resulis database.e un-regulainese buildi

NSAC.

h respect tenario respe

cclusive qumera(Y axis:

ive queries m shot of a arks aroundGPS inform

visual sear

base photo Il settlementse mean Avsition-sensit

rage the loce with the

( , )Dis A Q B

between quaphical distaquery Q andin Equationdiscovered

lts in tradit. There are ar for spatiings) that ha

to differenectively as f

ueries with : mAP@N p

come froma large scaled the query mation. Thirch with GP

Ii. We denot, we have

verage Precitive ranking

cation contBoW distan

( ,BoWDis A

uery Q and ance (measud database imn (1).

that the RAtional visuatwo possiblal re-rankinave very sim

nt challengfollows:

respect to performanc

m a large sce landmark.

location, ws may eveS informati

ote the hieraH = 6 an

ision at N (g precision a

text to refinnces to the

)Q (1)

database imured by GPmage A resp

ANSAC baal experimenle reasons: (ng; (2) Themilar local f

ging scena

difference e; X axis: to

cale landmaIn such cas

which woulden degeneraion.

archical levnd B = 10,(mAP@N) at the top N

ine the visuquery exam

)

mage A; GePS distance)pectively. O

ased spatial ents, does n(1) PKUBeere are lotsfeatures, wh

arios: We d

methods uop N return

ark, as occlse, GPS pod lead to pate the vis

el as H and producingto evaluate

N returning.

ual rankingmple, based

eoDis(A,Q)) as well asOur ranking

re-ranking,not producench usually

s of similarhich cannot

discuss the

sing digital

ning results)

usion oftensition tendserformancesual search

d g e

g d

) s g

, e y r t

e

l ).

n s e h

Fig

In psuch casmajor p

The Nrecognitdifferenenough camera

.9. The perf

practice, theses, the pur

part of a que

Fig.10. T

Night querytion perform

nt from the at night. It can achieve

formance of

e backgrounrely visual sery photo is

The perform

y is an intermance. Extrday time. is worth m

e better visu

f backgroun

nd clutter tysearch perfoactually oc

mance of Ni

resting case,racting distHence, weentioning thual search p

nd clutters q

ypically hapormance percupied by b

ight queries

, where GPtinguishing e can obserhat due to b

performance

queries with

ppens in caprforms worsbackgrounds

with respec

PS (contextulocal featuve that usin

better imagee than a mob

h respect to

pturing smase, due to ths.

ct to differe

ual informatures is very ng solely Ge capturing bile phone.

different m

all scale lanhat in most

ent methods

tion) roles tdifficult, th

GPS is almquality, usi

ethods.

ndmarks. Inqueries the

s.

the locationhat is quite

most alreadying a digital

n e

n e y l

F

Fromvisual sbecome

OverexemplaFigure 1phone c

Note mobile por mobi

F

Fig. 11. The

m Fig.11, wesearch perfoe much more

rall Performar scenarios12, which gcamera.

that using phone photile phone is

Fig.12. Over

e performan

e find that ormance. He acceptable

mance Coms) with resp

gives an intu

solely visutos; but withalmost iden

rall perform

nce of blurri

introducingHowever, by

e comparing

mparison: Wpect to usinuitive findin

ual search, th the combintical.

mance comp

ing and sha

g blurring ay incorporatg with the p

We further sng either ding about the

the performination of G

parison betw

aking querie

and shakingting GPS inpure visual q

show the ovigital came

e mAP diffe

mance of caGPS, the per

ween using c

es of phone

g would defnto similariquery result

verall perforra and mob

erence betwe

amera photorformances

camera and

camera pho

finitely degity ranking,ts.

rmance (168bile phone

ween digital

os is better of using eit

mobile pho

otos.

generate the the results

8 queries ofcameras incamera and

than usingther camera

ones.

e s

f n d

g a

Figurthe rest visual sscales dsearch paround scale lan

Final168 queand visupure vis

re 13 furtheas the searc

search perfodue to less bperformancea larger scandmarks yie

Fig. 1

ly, we inveeries), as shual search tsual search,

Fig. 14. O

er compares ched datase

ormance of background e of small s

ale landmarkeld better re

3. Performa

estigate the hown in Figtogether. Athis degene

Overall perf

the performet) among dlarge scale clusters and

scale is bettk. Moreoveresults of fus

ance compa

overall pergure 14. Unlthough diseration effec

formance of

mance over different lan

landmarks d more distiter than largr, as the GP

sing visual s

arison amon

rformance ondoubtedly, stractor imacts are allev

f 570 querie

the whole ddmark scaleare much b

inguishing ige scale, asPS plays relsearch and G

ng different

of in total 5the best res

ages typicallviated by int

es in Peking

database (ones. It is worbetter than tinteresting p the GPS siatively impGPS inform

scales of la

570 queriessults come ly degenerategrating GP

g University

ne image asrth mentionthe mediumpoints. The ignal may b

portant rolesmation.

andmarks.

s (includingfrom fusing

ate the perfPS with vis

y Landmark

s the query,ning that them and small

GPS basedbe distorteds, the small-

g the aboveg both GPSformance ofsual cues.

ks.

, e l d d -

e S f

mAP15-16, operformdiverse both bluinterest adding taken se

P Performaover those

mance can bperformancurring/Nighpoints wou

distractors eriously, wh

Fig.15. S

Fig.16. G

ance with r168 querie

be more orces. Generaht and addinuld be chalin Fig. 15

hile the simp

olely BoW

GPS+BoW p

respect to Des, it is quir less imprally speakinng distractollenged by and 16, wple combina

performanc

performanc

Different Site obvious roved, whilng, the worsor images. mobile blu

we can see tation is not

ce comparis

ce comparis

earch Basethat by ad

le different st performanThe former

urring querithe use of crobust enou

sons in five

ons in five t

elines: Furthdding contex

mobile qunces originar indicates ies. By comcontextual iugh for deal

typical que

typical quer

hermore, frextual informuery scenarate from thethat the us

mparing theinformationling with di

ery scenario

ry scenarios

om Figuresmation, therios presente queries ofse of visuale results ofn should beistractors.

os.

s.

s e t f l f e

7 Application Scenarios

We brief possible application scenarios of our Peking University Landmarks database as follows:

A benchmark dataset for mobile visual search: We hope the Peking University Landmarks could become useful resource to validate mobile visual search systems. It emphasizes two important factors in mobile visual search: query quality and contextual cues. To the best of our knowledge, both are beyond the state-of-the-art benchmark databases. In addition, it offers a dataset to evaluate the effectiveness and robustness of contextual information.

A benchmark dataset for location recognition: This dataset can be used to evaluate traditional location recognition systems since GPS location are bound with each image instance

A training resource for scene modeling: This dataset may facilitate scene analysis and modeling since our photograph is well designed to cover multi-shot, multi-view appearances of the landmarks of multi-scale. To this end, we will provide the camera calibration information in our future work.

A training resource to learn better photograph manners: Our landmark photo collection can be further exploited to learn the (recommended) mobile visual photograph manners (proper angle and shot size for different types of landmarks) towards better visual search results.

8 References

[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li ImageNet: A Large-Scale Hierarchical Image Database. CVPR. 2010. [2] H. Shao, T. Svoboda, and L. Van Gool ZuBuD-zurich buildings database for image based recognition. Technical Report, Computer Vision Lab, Swiss Federal Institute of Technology. 2006. [3] Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. CVPR. 2007. [4] Rongrong Ji, Xing Xie, Hongxun Yao, Wei-Ying Ma Hierarchical Optimization of Visual Vocabulary for Effective and Transferable Retrieval. CVPR. 2009. [5] Nister D. and Stewenius H. Scalable recognition with a vocabulary tree. CVPR. 2006. [6] S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, J. Singh, and B. Girod. Location Coding for Mobile Image Retrieval. Proc. 5th International Mobile Multimedia Communications Conference, MobiMedia. 2009. [7] Lowe D. G. Distinctive image features from scale invariant key points. IJCV. 2004. [8] M. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM, 24: 381-395, 1981. [9]. Rongrong Ji, Lingyu Duan, Tiejun Huang, Hongxun Yao, and Wen Gao. Compact Descriptors for Visual Search - Location Discriminative Mobile Landmark Search, CDVS AD HOC Group, Input Contribution m18542, 94th MPEG Meeting, Oct. 2010

peking-university-landmarks-a-context-aware-visual-search-benchmark-database

Data & Analytics

mobile landmark search

search s

compact visual descriptor

mobile scenarios

mobile photographing

mobile devices

benchmark dataset

peking university campus