Tree structured CRF models for interactive image labeling Thomas Mensink 1,2 Gabriela Csurka 1 Jakob Verbeek 2 1 Xerox Research Centre Europe, Grenoble, France 2 INRIA, Grenoble, France To appear at CVPR 2011 1
Tree structured CRF modelsfor interactive image labeling
Thomas Mensink1,2 Gabriela Csurka1 Jakob Verbeek2
1Xerox Research Centre Europe, Grenoble, France
2INRIA, Grenoble, France
To appear at CVPR 2011
1
Outline
1. Introduction
2. Structured image annotation models
3. Label Elicitation
4. Experimental Evaluation
5. Attribute-based image classification
2
Interactive Image labeling
Sky, Tree, Building, Sea, Plant, Ground, Rock, Person, Windows, Sand, Water.
Ask the user: Building (false), Rock (true), Sea (true), ...
Update the ranked list of keywords based on this information
3
Interactive Image labeling
Sky, Tree, Building, Sea, Plant, Ground, Rock, Person, Windows, Sand, Water.
Ask the user: Building (false), Rock (true), Sea (true), ...
Update the ranked list of keywords based on this information
3
Interactive Image labeling
Sky, Tree, Building, Sea, Plant, Ground, Rock, Person, Windows, Sand, Water.
Ask the user: Building (false), Rock (true), Sea (true), ...
Update the ranked list of keywords based on this information
3
Interactive Image labeling
Sky, Tree, Building, Sea, Plant, Ground, Rock, Person, Windows, Sand, Water.
Ask the user: Building (false), Rock (true), Sea (true), ...
Update the ranked list of keywords based on this information
3
Introduction - 1
Image labeling problem, a.k.a. classification, annotation,attribute prediction,. . .Used for e.g.: keyword based retrieval, indexing ,clustering, . . .State of the art: train binary SVMs per label using fancyfeatures (SIFT, Bow, Fisher Kernels, spatial pyramids, ...)
Problem 1: it ignores structure in output, correlationbetween labels (e.g. car & indoor ).Problem 2: how to incorporate user input
4
Introduction - 1
Image labeling problem, a.k.a. classification, annotation,attribute prediction,. . .Used for e.g.: keyword based retrieval, indexing ,clustering, . . .State of the art: train binary SVMs per label using fancyfeatures (SIFT, Bow, Fisher Kernels, spatial pyramids, ...)Problem 1: it ignores structure in output, correlationbetween labels (e.g. car & indoor ).Problem 2: how to incorporate user input
4
Introduction - 2
How to obtain a (tractable) structure?How to learn the parameters of this structure?How to select labels to ask the user?How does it perform?
5
Outline
1. Introduction
2. Structured image annotation models
3. Label Elicitation
4. Experimental Evaluation
5. Attribute-based image classification
6
Tree Structures
airplane
armchair
awningbag
balconyball
bars
basketbed
bench
bookcase
books
bottle
bottles
bowl
box
boxes
bread
building
bus
cabinet
candlecar
chair
chandelier
clock
closet
clothes
counter
countertopcupboard
curtain
cushion
desk
dish
dishwasher
dome
door
drawer
easel
fencefield
fireplace
floor
flowers
gate
glassgrass
ground
handrail
headstone
machine
microwave
mirror
monitor
mountain
oven
pathperson
picture
pillow
plant
plate
platform
posterpot
railing
refrigerator
river
road
rock
rocks
rug
sand
screen
sea
seats
shelves
shoes
showcase
sink
sky
sofa
staircase
standsteps
stone
stones
stool
stove
streetlight
table
television
text
toilettowel
tower
tray
tree
truck
umbrella
van
vase
videos
wallwater
window
Nodes are (class/category/attributes) labels.Learn weights between nodes to encode co-occurrence.Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.
7
Tree Structures
airplane
armchair
awningbag
balconyball
bars
basketbed
bench
bookcase
books
bottle
bottles
bowl
box
boxes
bread
building
bus
cabinet
candlecar
chair
chandelier
clock
closet
clothes
counter
countertopcupboard
curtain
cushion
desk
dish
dishwasher
dome
door
drawer
easel
fencefield
fireplace
floor
flowers
gate
glassgrass
ground
handrail
headstone
machine
microwave
mirror
monitor
mountain
oven
pathperson
picture
pillow
plant
plate
platform
posterpot
railing
refrigerator
river
road
rock
rocks
rug
sand
screen
sea
seats
shelves
shoes
showcase
sink
sky
sofa
staircase
standsteps
stone
stones
stool
stove
streetlight
table
television
text
toilettowel
tower
tray
tree
truck
umbrella
van
vase
videos
wallwater
window
Nodes are (class/category/attributes) labels.
Learn weights between nodes to encode co-occurrence.Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.
7
Tree Structures
airplane
armchair
awningbag
balconyball
bars
basketbed
bench
bookcase
books
bottle
bottles
bowl
box
boxes
bread
building
bus
cabinet
candlecar
chair
chandelier
clock
closet
clothes
counter
countertopcupboard
curtain
cushion
desk
dish
dishwasher
dome
door
drawer
easel
fencefield
fireplace
floor
flowers
gate
glassgrass
ground
handrail
headstone
machine
microwave
mirror
monitor
mountain
oven
pathperson
picture
pillow
plant
plate
platform
posterpot
railing
refrigerator
river
road
rock
rocks
rug
sand
screen
sea
seats
shelves
shoes
showcase
sink
sky
sofa
staircase
standsteps
stone
stones
stool
stove
streetlight
table
television
text
toilettowel
tower
tray
tree
truck
umbrella
van
vase
videos
wallwater
window
Nodes are (class/category/attributes) labels.Learn weights between nodes to encode co-occurrence.
Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.
7
Tree Structures
airplane
armchair
awningbag
balconyball
bars
basketbed
bench
bookcase
books
bottle
bottles
bowl
box
boxes
bread
building
bus
cabinet
candlecar
chair
chandelier
clock
closet
clothes
counter
countertopcupboard
curtain
cushion
desk
dish
dishwasher
dome
door
drawer
easel
fencefield
fireplace
floor
flowers
gate
glassgrass
ground
handrail
headstone
machine
microwave
mirror
monitor
mountain
oven
pathperson
picture
pillow
plant
plate
platform
posterpot
railing
refrigerator
river
road
rock
rocks
rug
sand
screen
sea
seats
shelves
shoes
showcase
sink
sky
sofa
staircase
standsteps
stone
stones
stool
stove
streetlight
table
television
text
toilettowel
tower
tray
tree
truck
umbrella
van
vase
videos
wallwater
window
Nodes are (class/category/attributes) labels.Learn weights between nodes to encode co-occurrence.Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.
7
Tree structured model on image labels
Each node presents a label in the tree.Vector of (binary) labels: y = {y1, . . . , yL}.Edges (L-1) are (somehow) given: E = {e1, . . . ,eL−1}.
E(y ,x) =L∑
i=1
ψi(yi ,x) +∑
(i,j)∈E
ψij(yi , yj), (1)
p(y |x) =1
Z (x)exp−E(y ,x), (2)
Z (x) =∑
y∈{0,1}L
exp−E(y ,x) (3)
8
Unary Potentials
E(y ,x) =L∑
i=1
ψi(yi ,x)︸ ︷︷ ︸Unary Potentials
+∑
(i,j)∈E
ψij(yi , yj)
yi is a label Rock, Sea, City, People,. . .ψi(yi = l ,x) = [φi(x),1]>w l
i
φi(x): Pre-trained SVM score for label i
9
Pairwise Potentials
E(y ,x) =L∑
i=1
ψi(yi ,x) +∑
(i,j)∈E
ψij(yi , yj)︸ ︷︷ ︸Pairwise Potentials
yi = Sand, and yj = CityIndependent of image input
ψij(yi = s, yj = t) = vstij
10
Defining the Tree
Optimal tree structure for conditional models is intractableFor generative models use the Chow-Liu algorithm
Fully connected graphEdge weight = Mutual InformationMaximum Spanning Tree
11
LearningLearning w and v in unary and pairwise potentialsUsing Log-likelihood (concave):
L =N∑
n=1
Ln =N∑
n=1
ln p(yn|xn).
Gradients:
∂Ln
∂w li
=(
p(yi = l |xn)− [[yin = l]])φi(xn), (4)
∂Ln
∂vstij
= p(yi = s, yj = t |xn)− [[yin = s, yjn = t ]], (5)
12
Trees over groups of labels
OutdoorDay
No Visual Time
Single Person
No Persons
Male
PortraitFem
aleAdult
Motion Blur
Partly Blurred
No Blur
Landscape Nature
SkyClouds
PlantsFlowers
Trees
Summ
erW
inter
No Visual Season
Animals
DogBird
Overexposed
Underexposed
Neutral Illumination
Indoor
No Visual Place
Sunny
Water
RiverSea
Aesthetic Impression
Overall Quality
Fancy
VehicleCarShip
Visual Arts
ArtificialNatural
Family Friends
Small Group
Teenager
Still LifeFoodToy
CitylifeNight
Street
Park Garden
BoringCute
Building Sights
Architecture
Church
Partylife
Big Group
Musical Instrum
ent
Sunset Sunrise
Macro
Insect
SportsBicycle
Skateboard
BridgeTravel
Train
GraffitiPainting
Abstract
DesertLake
Mountains
Work
Technical
Old Person
Out of Focus
ShadowBodypart
Beach Holidays
BabyChild
SnowSpring
Autumn
BirthdayCat
Airplane
RainHorseFish
13
Trees over groups of labels
To allow more dependencies between labelsA node is a group of fully connected labels.Every state modeled explicitly, a node has 2k states.
To define a tree-structure• Agglomerative clustering of labels,• Chow-Liu algorithm on these clusters.
14
Compund Node
State Marginal Landscape/Nature Sky Clouds1 3.4 % 0 0 02 0.0 % 0 0 13 9.8 % 0 1 04 59.9 % 0 1 15 0.4 % 1 0 06 0.0 % 1 0 17 2.6 % 1 1 08 23.9 % 1 1 1
Marginal on label = true 26.9% 96.2% 83.8%
BP gives us node marginals,read-off label marginals p(yi |x).message passing: O(22k )
15
Outline
1. Introduction
2. Structured image annotation models
3. Label Elicitation
4. Experimental Evaluation
5. Attribute-based image classification
16
Label Elicitation
SUN 09 - 5 labels Before Questions
After
01 Sky
01 Rock
02 Tree
02 Rocks
03 Building
03 Sea
04 Sea
Building 04 Sky
05 Rocks
Tree 05 Sand
06 Plant
Sea 06 Ground
07 Ground
Rocks 07 Plant
08 Rock
Rock 08 Person
09 Person
09 Window
10 Window
10 Water
17
Label Elicitation
SUN 09 - 5 labels Before Questions
After
01 Sky
01 Rock
02 Tree
02 Rocks
03 Building
03 Sea
04 Sea Building
04 Sky
05 Rocks
Tree 05 Sand
06 Plant
Sea 06 Ground
07 Ground
Rocks 07 Plant
08 Rock
Rock 08 Person
09 Person
09 Window
10 Window
10 Water
17
Label Elicitation
SUN 09 - 5 labels Before Questions
After
01 Sky
01 Rock
02 Tree
02 Rocks
03 Building
03 Sea
04 Sea Building
04 Sky
05 Rocks Tree
05 Sand
06 Plant
Sea 06 Ground
07 Ground
Rocks 07 Plant
08 Rock
Rock 08 Person
09 Person
09 Window
10 Window
10 Water
17
Label Elicitation
SUN 09 - 5 labels Before Questions
After
01 Sky
01 Rock
02 Tree
02 Rocks
03 Building
03 Sea
04 Sea Building
04 Sky
05 Rocks Tree
05 Sand
06 Plant Sea
06 Ground
07 Ground
Rocks 07 Plant
08 Rock
Rock 08 Person
09 Person
09 Window
10 Window
10 Water
17
Label Elicitation
SUN 09 - 5 labels Before Questions
After
01 Sky
01 Rock
02 Tree
02 Rocks
03 Building
03 Sea
04 Sea Building
04 Sky
05 Rocks Tree
05 Sand
06 Plant Sea
06 Ground
07 Ground Rocks
07 Plant
08 Rock
Rock 08 Person
09 Person
09 Window
10 Window
10 Water
17
Label Elicitation
SUN 09 - 5 labels Before Questions
After
01 Sky
01 Rock
02 Tree
02 Rocks
03 Building
03 Sea
04 Sea Building
04 Sky
05 Rocks Tree
05 Sand
06 Plant Sea
06 Ground
07 Ground Rocks
07 Plant
08 Rock Rock
08 Person
09 Person
09 Window
10 Window
10 Water
17
Label Elicitation
SUN 09 - 5 labels Before Questions After01 Sky 01 Rock02 Tree 02 Rocks03 Building 03 Sea04 Sea Building 04 Sky05 Rocks Tree 05 Sand06 Plant Sea 06 Ground07 Ground Rocks 07 Plant08 Rock Rock 08 Person09 Person 09 Window10 Window 10 Water
17
Label Elicitation
interactive setting: Ask the user at test time to set someof many labels for a single example.
active learning: Ask the user at train time for class labelof some of many examples.
18
Label Elicitation
Select a label i such that expected uncertaintyin remaining labels is minimized:
H(y\i |yi ,x) =∑
l
p(yi = l |x)H(y\i |yi = l ,x).
Entropy Identity:
H(y |x) = H(yi |x) + H(y\i |yi ,x)
Equals to select label i with highest entropy H(yi |x).
19
Outline
1. Introduction
2. Structured image annotation models
3. Label Elicitation
4. Experimental Evaluation
5. Attribute-based image classification
20
Databases
Table: Basic statistics of the three data sets.
ImageCLEF SUN’09 Animals w.A.# Train images 6400 4367 24295# Test images 1600 4317 6180# Labels 93 107 85Train img/label 833 219 8812Train label/img 12.1 5.34 30.8Nr of parameters for k = 1 ± 740 852 676trees with k = 2 ±1284 1480 1172group size k = 3 ±2912 3340 2644
k = 4 ±7508 8640 6836
Performance evaluated using:• MAP: retrieval performance per label,• iMAP: annotation performance per image.
21
Results 1
MAP iMAPImageCLEF’10 SUN09 AwA ImageCLEF’10 SUN09 AwA
I k1 k2 k3 k4 M40
41
42
43
44
45
I k1 k2 k3 k4 M25
30
35
I k1 k2 k3 k4 M55
60
65
I k1 k2 k3 k4 M75
76
77
78
79
80
I k1 k2 k3 k4 M70
71
72
73
74
75
I k1 k2 k3 k4 M70
71
72
73
74
75
I k1 k2 k3 k4 M55
60
65
70
I k1 k2 k3 k4 M45
50
55
60
I k1 k2 k3 k4 M70
75
80
85
I k1 k2 k3 k4 M85
90
95
I k1 k2 k3 k4 M85
86
87
88
89
90
I k1 k2 k3 k4 M80
82
84
86
88
90
22
Results 1
MAP iMAPImageCLEF’10 SUN09 AwA ImageCLEF’10 SUN09 AwA
I k1 k2 k3 k4 M40
41
42
43
44
45
I k1 k2 k3 k4 M25
30
35
I k1 k2 k3 k4 M55
60
65
I k1 k2 k3 k4 M75
76
77
78
79
80
I k1 k2 k3 k4 M70
71
72
73
74
75
I k1 k2 k3 k4 M70
71
72
73
74
75
I k1 k2 k3 k4 M55
60
65
70
I k1 k2 k3 k4 M45
50
55
60
I k1 k2 k3 k4 M70
75
80
85
I k1 k2 k3 k4 M85
90
95
I k1 k2 k3 k4 M85
86
87
88
89
90
I k1 k2 k3 k4 M80
82
84
86
88
90
22
Results 2
0 20 40 60 800.4
0.5
0.6
0.7
0.8
0.9
1
Nr Questions
MA
P
Indep − RandIndep − EntMixt − RandMixt − Ent
0 20 40 60 800.75
0.8
0.85
0.9
0.95
1
Nr Questions
iMA
PInteractive image annotation performance as a function ofthe amount of user input, ImageCLEF dataset
23
Outline
1. Introduction
2. Structured image annotation models
3. Label Elicitation
4. Experimental Evaluation
5. Attribute-based image classification
24
Attribute-based image classification
25
Attribute-based image classification - 2
Predict attributes with our tree-structured models.
Deterministic mapping between attributes and classes.
p(z =c|x)=p(yc |x)∑C
c′=1 p(yc′ |x)=
exp−E(yc ,x)∑Cc′=1 exp−E(yc′ ,x)
. (6)
Note: does not require belief-propagation, it suffices toevaluate E(yc ,x) for the C attribute configurations.
26
Attribute-based image classification - 2
Predict attributes with our tree-structured models.Deterministic mapping between attributes and classes.
p(z =c|x)=p(yc |x)∑C
c′=1 p(yc′ |x)=
exp−E(yc ,x)∑Cc′=1 exp−E(yc′ ,x)
. (6)
Note: does not require belief-propagation, it suffices toevaluate E(yc ,x) for the C attribute configurations.
26
Correction TermObservation: some classes are over-predicted:
p(z = c|x) ∝ exp(− E(yc ,x)− uc
), (7)
40 train / 10 test classes 50 train / 50 test classesIndependent Mixture Independent Mixture
Class Acc 35.84 Class Acc 14.12 Class Acc 36.20 Class Acc 36.10
Class Acc 36.53 Class Acc 38.72 Class Acc 40.06 Class Acc 43.54
27
Label Elicitation for classification
Label Elicitation on Attribute Level
Goal to minimize uncertainty on class labelAny informative question rules out at least 1 class.
Results (again) in attribute i with highest entropy H(yi |x).But p(yi |x) is defined differently:
p(yi = 1|x) =∑
c
p(z = c|x)yic , (8)
28
Label Elicitation for classification
Label Elicitation on Attribute LevelGoal to minimize uncertainty on class labelAny informative question rules out at least 1 class.
Results (again) in attribute i with highest entropy H(yi |x).But p(yi |x) is defined differently:
p(yi = 1|x) =∑
c
p(z = c|x)yic , (8)
28
Label Elicitation for classification
Label Elicitation on Attribute LevelGoal to minimize uncertainty on class labelAny informative question rules out at least 1 class.
Results (again) in attribute i with highest entropy H(yi |x).But p(yi |x) is defined differently:
p(yi = 1|x) =∑
c
p(z = c|x)yic , (8)
28
Results Classification
Init 1 2 3 4 5 6 7 8Indep 36.5 53.1 68.5 77.8 85.1 90.6 94.5 97.7 99.4Mixt 38.7 55.3 72.3 84.8 92.4 96.9 99.0 99.8 100.0
classification accuracy of the independent and mixture oftrees models.Initial results, and after user input for one up to eightselected attributes.
29
Conclusions
Tree-structured CRF models for interactive• Image annotation, and• Attribute-based classification
Improves moderately over independent modelsReal power in interactive setting: (i) propagate user input,(ii) ask more informative questions
30
Tree structured CRF modelsfor interactive image labeling
Questions?!?
31