Tree structured CRF models for interactive image labeling...chandelier clock closet clothes counter countertop cupboard curtain cushion desk dish dishwasher dome door drawer easel

Tree structured CRF modelsfor interactive image labeling

Thomas Mensink1,2 Gabriela Csurka1 Jakob Verbeek2

1Xerox Research Centre Europe, Grenoble, France

2INRIA, Grenoble, France

To appear at CVPR 2011

1

Outline

1. Introduction

2. Structured image annotation models

3. Label Elicitation

4. Experimental Evaluation

5. Attribute-based image classification

2

Interactive Image labeling

Sky, Tree, Building, Sea, Plant, Ground, Rock, Person, Windows, Sand, Water.

Ask the user: Building (false), Rock (true), Sea (true), ...

Update the ranked list of keywords based on this information

3





3





3





3

Introduction - 1

Image labeling problem, a.k.a. classification, annotation,attribute prediction,. . .Used for e.g.: keyword based retrieval, indexing ,clustering, . . .State of the art: train binary SVMs per label using fancyfeatures (SIFT, Bow, Fisher Kernels, spatial pyramids, ...)

Problem 1: it ignores structure in output, correlationbetween labels (e.g. car & indoor ).Problem 2: how to incorporate user input

4

Introduction - 1

Image labeling problem, a.k.a. classification, annotation,attribute prediction,. . .Used for e.g.: keyword based retrieval, indexing ,clustering, . . .State of the art: train binary SVMs per label using fancyfeatures (SIFT, Bow, Fisher Kernels, spatial pyramids, ...)Problem 1: it ignores structure in output, correlationbetween labels (e.g. car & indoor ).Problem 2: how to incorporate user input

4

Introduction - 2

How to obtain a (tractable) structure?How to learn the parameters of this structure?How to select labels to ask the user?How does it perform?

5

Outline

1. Introduction





6

Tree Structures

airplane

armchair

awningbag

balconyball

bars

basketbed

bench

bookcase

books

bottle

bottles

bowl

box

boxes

bread

building

bus

cabinet

candlecar

chair

chandelier

clock

closet

clothes

counter

countertopcupboard

curtain

cushion

desk

dish

dishwasher

dome

door

drawer

easel

fencefield

fireplace

floor

flowers

gate

glassgrass

ground

handrail

headstone

machine

microwave

mirror

monitor

mountain

oven

pathperson

picture

pillow

plant

plate

platform

posterpot

railing

refrigerator

river

road

rock

rocks

rug

sand

screen

sea

seats

shelves

shoes

showcase

sink

sky

sofa

staircase

standsteps

stone

stones

stool

stove

streetlight

table

television

text

toilettowel

tower

tray

tree

truck

umbrella

van

vase

videos

wallwater

window

Nodes are (class/category/attributes) labels.Learn weights between nodes to encode co-occurrence.Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.

7

Tree Structures

airplane

armchair

awningbag

balconyball

bars

basketbed

bench

bookcase

books

bottle

bottles

bowl

box

boxes

bread

building

bus

cabinet

candlecar

chair

chandelier

clock

closet

clothes

counter

countertopcupboard

curtain

cushion

desk

dish

dishwasher

dome

door

drawer

easel

fencefield

fireplace

floor

flowers

gate

glassgrass

ground

handrail

headstone

machine

microwave

mirror

monitor

mountain

oven

pathperson

picture

pillow

plant

plate

platform

posterpot

railing

refrigerator

river

road

rock

rocks

rug

sand

screen

sea

seats

shelves

shoes

showcase

sink

sky

sofa

staircase

standsteps

stone

stones

stool

stove

streetlight

table

television

text

toilettowel

tower

tray

tree

truck

umbrella

van

vase

videos

wallwater

window

Nodes are (class/category/attributes) labels.

Learn weights between nodes to encode co-occurrence.Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.

7

Tree Structures

airplane

armchair

awningbag

balconyball

bars

basketbed

bench

bookcase

books

bottle

bottles

bowl

box

boxes

bread

building

bus

cabinet

candlecar

chair

chandelier

clock

closet

clothes

counter

countertopcupboard

curtain

cushion

desk

dish

dishwasher

dome

door

drawer

easel

fencefield

fireplace

floor

flowers

gate

glassgrass

ground

handrail

headstone

machine

microwave

mirror

monitor

mountain

oven

pathperson

picture

pillow

plant

plate

platform

posterpot

railing

refrigerator

river

road

rock

rocks

rug

sand

screen

sea

seats

shelves

shoes

showcase

sink

sky

sofa

staircase

standsteps

stone

stones

stool

stove

streetlight

table

television

text

toilettowel

tower

tray

tree

truck

umbrella

van

vase

videos

wallwater

window

Nodes are (class/category/attributes) labels.Learn weights between nodes to encode co-occurrence.

Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.

7

Tree Structures

airplane

armchair

awningbag

balconyball

bars

basketbed

bench

bookcase

books

bottle

bottles

bowl

box

boxes

bread

building

bus

cabinet

candlecar

chair

chandelier

clock

closet

clothes

counter

countertopcupboard

curtain

cushion

desk

dish

dishwasher

dome

door

drawer

easel

fencefield

fireplace

floor

flowers

gate

glassgrass

ground

handrail

headstone

machine

microwave

mirror

monitor

mountain

oven

pathperson

picture

pillow

plant

plate

platform

posterpot

railing

refrigerator

river

road

rock

rocks

rug

sand

screen

sea

seats

shelves

shoes

showcase

sink

sky

sofa

staircase

standsteps

stone

stones

stool

stove

streetlight

table

television

text

toilettowel

tower

tray

tree

truck

umbrella

van

vase

videos

wallwater

window

Nodes are (class/category/attributes) labels.Learn weights between nodes to encode co-occurrence.Exact inference in tree structure is tractable (using BP).Inference is used for learning, label prediction and label elicitation.

7

Tree structured model on image labels

Each node presents a label in the tree.Vector of (binary) labels: y = {y1, . . . , yL}.Edges (L-1) are (somehow) given: E = {e1, . . . ,eL−1}.

E(y ,x) =L∑

i=1

ψi(yi ,x) +∑

(i,j)∈E

ψij(yi , yj), (1)

p(y |x) =1

Z (x)exp−E(y ,x), (2)

Z (x) =∑

y∈{0,1}L

exp−E(y ,x) (3)

8

Unary Potentials

E(y ,x) =L∑

i=1

ψi(yi ,x)︸︷︷︸Unary Potentials

+∑

(i,j)∈E

ψij(yi , yj)

yi is a label Rock, Sea, City, People,. . .ψi(yi = l ,x) = [φi(x),1]>w l

i

φi(x): Pre-trained SVM score for label i

9

Pairwise Potentials

E(y ,x) =L∑

i=1

ψi(yi ,x) +∑

(i,j)∈E

ψij(yi , yj)︸︷︷︸Pairwise Potentials

yi = Sand, and yj = CityIndependent of image input

ψij(yi = s, yj = t) = vstij

10

Defining the Tree

Optimal tree structure for conditional models is intractableFor generative models use the Chow-Liu algorithm

Fully connected graphEdge weight = Mutual InformationMaximum Spanning Tree

11

LearningLearning w and v in unary and pairwise potentialsUsing Log-likelihood (concave):

L =N∑

n=1

Ln =N∑

n=1

ln p(yn|xn).

Gradients:

∂Ln

∂w li

=(

p(yi = l |xn)− [[yin = l]])φi(xn), (4)

∂Ln

∂vstij

= p(yi = s, yj = t |xn)− [[yin = s, yjn = t ]], (5)

12

Trees over groups of labels

OutdoorDay

No Visual Time

Single Person

No Persons

Male

PortraitFem

aleAdult

Motion Blur

Partly Blurred

No Blur

Landscape Nature

SkyClouds

PlantsFlowers

Trees

Summ

erW

inter

No Visual Season

Animals

DogBird

Overexposed

Underexposed

Neutral Illumination

Indoor

No Visual Place

Sunny

Water

RiverSea

Aesthetic Impression

Overall Quality

Fancy

VehicleCarShip

Visual Arts

ArtificialNatural

Family Friends

Small Group

Teenager

Still LifeFoodToy

CitylifeNight

Street

Park Garden

BoringCute

Building Sights

Architecture

Church

Partylife

Big Group

Musical Instrum

ent

Sunset Sunrise

Macro

Insect

SportsBicycle

Skateboard

BridgeTravel

Train

GraffitiPainting

Abstract

DesertLake

Mountains

Work

Technical

Old Person

Out of Focus

ShadowBodypart

Beach Holidays

BabyChild

SnowSpring

Autumn

BirthdayCat

Airplane

RainHorseFish

13

Trees over groups of labels

To allow more dependencies between labelsA node is a group of fully connected labels.Every state modeled explicitly, a node has 2k states.

To define a tree-structure• Agglomerative clustering of labels,• Chow-Liu algorithm on these clusters.

14

Compund Node

State Marginal Landscape/Nature Sky Clouds1 3.4 % 0 0 02 0.0 % 0 0 13 9.8 % 0 1 04 59.9 % 0 1 15 0.4 % 1 0 06 0.0 % 1 0 17 2.6 % 1 1 08 23.9 % 1 1 1

Marginal on label = true 26.9% 96.2% 83.8%

BP gives us node marginals,read-off label marginals p(yi |x).message passing: O(22k )

15

Outline

1. Introduction





16

Label Elicitation

SUN 09 - 5 labels Before Questions

After

01 Sky

01 Rock

02 Tree

02 Rocks

03 Building

03 Sea

04 Sea

Building 04 Sky

05 Rocks

Tree 05 Sand

06 Plant

Sea 06 Ground

07 Ground

Rocks 07 Plant

08 Rock

Rock 08 Person

09 Person

09 Window

10 Window

10 Water

17

Label Elicitation


After

01 Sky

01 Rock

02 Tree

02 Rocks

03 Building

03 Sea

04 Sea Building

04 Sky

05 Rocks

Tree 05 Sand

06 Plant

Sea 06 Ground

07 Ground

Rocks 07 Plant

08 Rock

Rock 08 Person

09 Person

09 Window

10 Window

10 Water

17

Label Elicitation


After

01 Sky

01 Rock

02 Tree

02 Rocks

03 Building

03 Sea

04 Sea Building

04 Sky

05 Rocks Tree

05 Sand

06 Plant

Sea 06 Ground

07 Ground

Rocks 07 Plant

08 Rock

Rock 08 Person

09 Person

09 Window

10 Window

10 Water

17

Label Elicitation


After

01 Sky

01 Rock

02 Tree

02 Rocks

03 Building

03 Sea

04 Sea Building

04 Sky

05 Rocks Tree

05 Sand

06 Plant Sea

06 Ground

07 Ground

Rocks 07 Plant

08 Rock

Rock 08 Person

09 Person

09 Window

10 Window

10 Water

17

Label Elicitation


After

01 Sky

01 Rock

02 Tree

02 Rocks

03 Building

03 Sea

04 Sea Building

04 Sky

05 Rocks Tree

05 Sand

06 Plant Sea

06 Ground

07 Ground Rocks

07 Plant

08 Rock

Rock 08 Person

09 Person

09 Window

10 Window

10 Water

17

Label Elicitation


After

01 Sky

01 Rock

02 Tree

02 Rocks

03 Building

03 Sea

04 Sea Building

04 Sky

05 Rocks Tree

05 Sand

06 Plant Sea

06 Ground

07 Ground Rocks

07 Plant

08 Rock Rock

08 Person

09 Person

09 Window

10 Window

10 Water

17

Label Elicitation

SUN 09 - 5 labels Before Questions After01 Sky 01 Rock02 Tree 02 Rocks03 Building 03 Sea04 Sea Building 04 Sky05 Rocks Tree 05 Sand06 Plant Sea 06 Ground07 Ground Rocks 07 Plant08 Rock Rock 08 Person09 Person 09 Window10 Window 10 Water

17

Label Elicitation

interactive setting: Ask the user at test time to set someof many labels for a single example.

active learning: Ask the user at train time for class labelof some of many examples.

18

Label Elicitation

Select a label i such that expected uncertaintyin remaining labels is minimized:

H(y\i |yi ,x) =∑

l

p(yi = l |x)H(y\i |yi = l ,x).

Entropy Identity:

H(y |x) = H(yi |x) + H(y\i |yi ,x)

Equals to select label i with highest entropy H(yi |x).

19

Outline

1. Introduction





20

Databases

Table: Basic statistics of the three data sets.

ImageCLEF SUN’09 Animals w.A.# Train images 6400 4367 24295# Test images 1600 4317 6180# Labels 93 107 85Train img/label 833 219 8812Train label/img 12.1 5.34 30.8Nr of parameters for k = 1 ± 740 852 676trees with k = 2 ±1284 1480 1172group size k = 3 ±2912 3340 2644

k = 4 ±7508 8640 6836

Performance evaluated using:• MAP: retrieval performance per label,• iMAP: annotation performance per image.

21

Results 1

MAP iMAPImageCLEF’10 SUN09 AwA ImageCLEF’10 SUN09 AwA

I k1 k2 k3 k4 M40

41

42

43

44

45

I k1 k2 k3 k4 M25

30

35

I k1 k2 k3 k4 M55

60

65

I k1 k2 k3 k4 M75

76

77

78

79

80

I k1 k2 k3 k4 M70

71

72

73

74

75

I k1 k2 k3 k4 M70

71

72

73

74

75

I k1 k2 k3 k4 M55

60

65

70

I k1 k2 k3 k4 M45

50

55

60

I k1 k2 k3 k4 M70

75

80

85

I k1 k2 k3 k4 M85

90

95

I k1 k2 k3 k4 M85

86

87

88

89

90

I k1 k2 k3 k4 M80

82

84

86

88

90

22

Results 1

MAP iMAPImageCLEF’10 SUN09 AwA ImageCLEF’10 SUN09 AwA

I k1 k2 k3 k4 M40

41

42

43

44

45

I k1 k2 k3 k4 M25

30

35

I k1 k2 k3 k4 M55

60

65

I k1 k2 k3 k4 M75

76

77

78

79

80

I k1 k2 k3 k4 M70

71

72

73

74

75

I k1 k2 k3 k4 M70

71

72

73

74

75

I k1 k2 k3 k4 M55

60

65

70

I k1 k2 k3 k4 M45

50

55

60

I k1 k2 k3 k4 M70

75

80

85

I k1 k2 k3 k4 M85

90

95

I k1 k2 k3 k4 M85

86

87

88

89

90

I k1 k2 k3 k4 M80

82

84

86

88

90

22

Results 2

0 20 40 60 800.4

0.5

0.6

0.7

0.8

0.9

1

Nr Questions

MA

P

Indep − RandIndep − EntMixt − RandMixt − Ent

0 20 40 60 800.75

0.8

0.85

0.9

0.95

1

Nr Questions

iMA

PInteractive image annotation performance as a function ofthe amount of user input, ImageCLEF dataset

23

Outline

1. Introduction





24

Attribute-based image classification

25

Attribute-based image classification - 2

Predict attributes with our tree-structured models.

Deterministic mapping between attributes and classes.

p(z =c|x)=p(yc |x)∑C

c′=1 p(yc′ |x)=

exp−E(yc ,x)∑Cc′=1 exp−E(yc′ ,x)

. (6)

Note: does not require belief-propagation, it suffices toevaluate E(yc ,x) for the C attribute configurations.

26

Attribute-based image classification - 2

Predict attributes with our tree-structured models.Deterministic mapping between attributes and classes.

p(z =c|x)=p(yc |x)∑C

c′=1 p(yc′ |x)=

exp−E(yc ,x)∑Cc′=1 exp−E(yc′ ,x)

. (6)

Note: does not require belief-propagation, it suffices toevaluate E(yc ,x) for the C attribute configurations.

26

Correction TermObservation: some classes are over-predicted:

p(z = c|x) ∝ exp(− E(yc ,x)− uc

), (7)

40 train / 10 test classes 50 train / 50 test classesIndependent Mixture Independent Mixture

Class Acc 35.84 Class Acc 14.12 Class Acc 36.20 Class Acc 36.10

Class Acc 36.53 Class Acc 38.72 Class Acc 40.06 Class Acc 43.54

27

Label Elicitation for classification

Label Elicitation on Attribute Level

Goal to minimize uncertainty on class labelAny informative question rules out at least 1 class.

Results (again) in attribute i with highest entropy H(yi |x).But p(yi |x) is defined differently:

p(yi = 1|x) =∑

c

p(z = c|x)yic , (8)

28


Label Elicitation on Attribute LevelGoal to minimize uncertainty on class labelAny informative question rules out at least 1 class.


p(yi = 1|x) =∑

c

p(z = c|x)yic , (8)

28


Label Elicitation on Attribute LevelGoal to minimize uncertainty on class labelAny informative question rules out at least 1 class.


p(yi = 1|x) =∑

c

p(z = c|x)yic , (8)

28

Results Classification

Init 1 2 3 4 5 6 7 8Indep 36.5 53.1 68.5 77.8 85.1 90.6 94.5 97.7 99.4Mixt 38.7 55.3 72.3 84.8 92.4 96.9 99.0 99.8 100.0

classification accuracy of the independent and mixture oftrees models.Initial results, and after user input for one up to eightselected attributes.

29

Conclusions

Tree-structured CRF models for interactive• Image annotation, and• Attribute-based classification

Improves moderately over independent modelsReal power in interactive setting: (i) propagate user input,(ii) ask more informative questions

30

Tree structured CRF modelsfor interactive image labeling

Questions?!?

31

Tree structured CRF models for interactive image labeling...chandelier clock closet clothes counter countertop cupboard curtain cushion desk dish dishwasher dome door drawer easel

Documents