S PARSE K ERNEL L EARNING FOR I MAGE A NNOTATION S EAN M ORAN † ,V ICTOR L AVRENKO † SEAN . MORAN @ ED . AC . UK R ESEARCH Q UESTION • How do we exploit multiple features for image annotation? I NTRODUCTION • Problem: Assigning one or more keywords to describe an image. • Probabilistic generative modelling approach: – Compute the conditional probability of word given image P (w |I ). – Take the 5 words with highest P (w |I ) as the image annotation. Feature Extraction GIST SIFT LAB HAAR Tiger, Grass, Whiskers City, Castle, Smoke Tiger, Tree, Leaves Eagle, Sky Training Dataset P(Tiger | ) = 0.15 P(Grass | ) = 0.12 P(Whiskers| ) = 0.12 Top 5 words as annotation This talk: How best to combine features? Multiple Features Ranked list of words Tiger, Grass, Tree Leaves, Whiskers Annotation Model P(Leaves | ) = 0.10 P(Tree | ) = 0.10 P(Smoke | ) = 0.01 Testing Image P(City | ) = 0.03 P(Waterfall | ) = 0.05 P(Castle | ) = 0.03 P(Eagle | ) = 0.02 P(Sky | ) = 0.08 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 6 X 5 X 4 X 3 X 2 X 1 X 6 X 5 X 4 X 3 X 2 X 1 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 6 X 5 X 4 X 3 X 2 X 1 X 6 X 5 X 4 X 3 X 2 X 1 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 6 X 5 X 4 X 3 X 2 X 1 X 6 X 5 X 4 X 3 X 2 X 1 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 6 X 5 X 4 X 3 X 2 X 1 X 6 X 5 X 4 X 3 X 2 X 1 X 1 X 2 X 3 X 4 X 5 X 6 • Advantages: – Permits image search based on natural language keywords. C ONTINUOUS R ELEVANCE M ODEL (CRM): L AVRENKO ET AL . ’03 • P (w, f ): joint expectation of words w and image features f defined by images J in the training set T : P (w, f )= X J ∈T P (J ) K Y i=1 P (w i |J ) M Y i=1 P ( ~ f i |J ) (1) • P (w i |J ) is modelled using a Dirichlet prior: P (w i |J )= μp v + N v,J μ + ∑ v 0 N v 0 ,J (2) • N v,J : number of times the word v appears in annotation of training image J , p v : relative frequency of word v , μ: smoothing parameter. • P ( ~ f j |J ) is modelled with a kernel-based density estimator: P ( ~ f i |J )= 1 R R X j =1 P ( ~ f i | ~ f j ) (3) • Each region j =1. . .R instantiates a Gaussian kernel, bandwidth β : P ( ~ f i | ~ f j )= 1 p 2 d π d β exp ( -|| ~ f i - ~ f j || 2 β ) (4) S PARSE K ERNEL L EARNING CRM (SKL-CRM) • Extend the CRM to M feature types (e.g. SIFT, HSV, RGB ...): P (I |J )= M Y i=1 R X j =1 exp ( - 1 β X u,v Ψ u,v k v ( ~ f u i , ~ f u j ) ) (5) • SKL-CRM learns Ψ u,v : an alignment matrix mapping kernel k v (e.g. Gaussian) to a feature type u (e.g. SIFT) G REEDY K ERNEL -F EATURE A LIGNMENT A LGORITHM • Greedily solve for the kernel-feature alignment matrix Ψ u,v • At each iteration add the kernel-feature that maximises F 1 Features Kernels Laplacian RGB HSV Optimal Alignment Matrix F1: 0.42 Gaussian Uniform X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 X 1 X 2 X 3 X 4 X 5 X 6 SIFT LAB 1 0 1 0 0 1 0 0 0 0 0 1 RGB SIFT LAB HSV Laplacian Gaussian Uniform Ψ vu X 6 −6 −4 −2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 GG(x ; p) p =2 −6 −4 −2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 GG(x ; p) p =1 −6 −4 −2 0 2 4 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 GG(x ; p) p =15 2nd 1st Iteration: 3rd 4th Optimal alignment after 4 iterations • SKL-CRM aligns to the generalised Gaussian, χ 2 , Hellinger and Multinomial kernels Q UANTITATIVE R ESULTS • Mean per word precision (P), recall (R), Number of words > 0 (N+) and F 1 measure: Dataset Corel 5K IAPR TC12 ESP Game R P F 1 N+ R P F 1 N+ P R F 1 N+ CRM 19 16 17 107 – – – – – – – – JEC 32 27 29 139 29 28 28 250 25 22 23 224 RF-opt 40 29 34 157 31 44 36 253 26 41 32 235 GS 33 30 31 146 29 32 30 252 – – – – KSVM-VT 42 32 36 179 29 47 36 268 32 33 33 259 Tagprop 42 33 37 160 35 46 40 266 27 39 32 239 SKL-CRM 46 39 42 184 32 47 38 274 26 41 32 248 • Rapid convergence to a sparse subset of the available features: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 10 20 30 40 50 60 70 80 90 100 Corel 5K ESP Game IAPR TC12 Feature count % of maximum F1 Rapid path to maximum F1 using minimal features S UMMARY OF KEY FINDINGS • Better to choose kernels based on the data than opt for default as- signment advocated in literature. • Only a small number of carefully chosen features are required for the best annotation performance. • See our ICMR’14 paper for further information and results.