Top Banner
Ivan Laptev [email protected] WILLOW, INRIA/ENS/CNRS, Paris Computer Vision: Weakly - supervised learning from video and images CSClub Saint Petersburg November 17, 2014 Joint work with: Piotr Bojanowski Rémi Lajugie Maxime Oquab Francis Bach Leon Bottou Jean Ponce Cordelia Schmid Josef Sivic
80
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computer Vision

Ivan Laptev

[email protected]

WILLOW, INRIA/ENS/CNRS, Paris

Computer Vision:

Weakly-supervised learning from

video and images

CSClub

Saint Petersburg

November 17, 2014

Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab –

Francis Bach – Leon Bottou – Jean Ponce –

Cordelia Schmid – Josef Sivic

Page 2: Computer Vision

Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891

VisionLabs – команда профессионалов, обладающихзначительными знаниями и существенным практическимопытом в сфере разработки алгоритмов компьютерногозрения и интеллектуальных систем.

Мы создаем и внедряем технологии компьютерного зрения, открывая новые

возможности для изменения окружающего нас мира к лучшему.

О компании

– Advertisement –

Page 3: Computer Vision

Команда

АлександрХанинChief

ExecutiveOfficer

АлексейНехаев

ExecutiveOfficer

Слава Казьмин

ChiefTechnical

Officer

ИванЛаптев

Scientificadvisor

СергейМиляев

SeniorCV engineer

АлексейКордичевFinancialadvisor

ИванТрусковSoftwaredeveloper

СергейЧерепанов

Softwaredeveloper

Наша команда –симбиоз науки и бизнеса

Направления деятельностиТехнология распознавания лицСистема выявления мошенников в банках

Технология распознавания номеровСистема учета и автоматизации доступа транспорта

Технологии для безопасного городаСистема выявления нарушений и опасных ситуаций

– Advertisement –

Page 4: Computer Vision

– Advertisement –

Проекты масштаба государства

Достижения

Page 5: Computer Vision

– Advertisement –

Мы ищем единомышленников

Создание и внедрение интеллектуальных систем

Решение интересных практических задач

Работа в дружной амбициозной команде

Спасибо за внимание!

Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891

Page 6: Computer Vision

What is Computer Vision?

Page 7: Computer Vision

7

What is Computer Vision?

Page 8: Computer Vision
Page 9: Computer Vision

What is the recent progress?

1990s:

Recognition at the level of a few

toy objects (COIL 20 dataset)

ResearchIndustry

Automated quality inspection

(controlled lighting, scale,…)

Now:

Face recognition in social media ImageNet: 14M images, 21K classes

6% Top-5 error rate in 2014 Challenge

Page 10: Computer Vision

~5K image uploads

every min. >34K hours of video

upload every day

TV-channels recorded

since 60’s

~30M surveillance cameras in US

=> ~700K video hours/day

~2.5 Billion new

images / month

And even more with future

wearable devices

Why image and video analysis?Data:

Page 11: Computer Vision

Movies TV

YouTube

Why looking at people?

How many person-pixels are in the video?

Page 12: Computer Vision

Movies TV

YouTube

How many person-pixels are in the video?

40%

35% 34%

Why looking at people?

Page 13: Computer Vision

How many person pixels

in our daily life?

Wearable camera data: Microsoft SenseCam dataset

Page 14: Computer Vision

How many person pixels

in our daily life?

Wearable camera data: Microsoft SenseCam dataset

~4%

Page 15: Computer Vision

Large variations in appearance:occlusions, non-rigid motion, view-point changes, clothing…

What are the difficulties?

Manual collection of training samples is prohibitive: many action classes, rare occurrence

Action vocabulary is not well-defined

Action Open:

Action Hugging:

Page 16: Computer Vision

This talk:

Brief overview of recent techniques

Weakly-supervised learning from video

and scripts

Weakly-supervised learning with

convolutional neural networks

Page 17: Computer Vision

Standard visual recognition pipeline

GetOutCar AnswerPhone

Kiss

HandShake StandUp

DriveCar

Collect image/video samples and corresponding class labels

Design appropriate data representation, with certain invariance properties

Design / use existing machine learning methods for learning and classification

Page 18: Computer Vision

Occurrence histogram

of visual words

space-time patches

Extraction of

Local features

Feature

description

K-means

clustering

(k=4000)

Feature

quantization

Non-linear

SVM with χ2

kernel

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

Bag-of-Features action recognition

Page 19: Computer Vision

Action classification

Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,

“Indiana Jones and the Last Crusade”

Page 20: Computer Vision

Where to get training data?

Shoot actions in the lab•

KTH dataset

Weizman dataset,…

- Limited variability

- Unrealistic

Manually annotate existing content•

HMDB, Olympic Sports,

UCF50, UCF101, …

- Very time-consuming

Use readily-available video scripts•

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com

- Scripts are available for 1000’s of hours of movies and TV-series

- Scripts describe dynamic and static content of videos

Page 21: Computer Vision

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

21

Page 22: Computer Vision

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

22

Page 23: Computer Vision

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

23

Page 24: Computer Vision

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

24

Page 25: Computer Vision

1172

01:20:17,240 --> 01:20:20,437

Why weren't you honest with me?

Why'd you keep your marriage a secret?

1173

01:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.

Victor wanted it that way.

1174

01:20:23,800 --> 01:20:26,189

Not even our closest friends

knew about our marriage.

RICK

Why weren't you honest with me? Why

did you keep your marriage a secret?

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard.

Victor wanted it that way. Not even

our closest friends knew about our

marriage.

01:20:17

01:20:23

subtitles movie script

• Scripts available for >500 movies (no time synchronization)

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …

• Subtitles (with time info.) are available for the most of movies

• Can transfer time to scripts by text alignment

Script-based video annotation

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

Page 26: Computer Vision

Scripts as weak supervision

Un

ce

rta

inty

24:25

24:51

Imprecise temporal localization•

No explicit spatial localization •

NLP problems, scripts ≠ training labels•

“… Will gets out of the Chevrolet. …”

“… Erin exits her new truck…”vs. Get-out-car

Challenges:

Page 27: Computer Vision

Previous work

Sivic, Everingham, and Zisserman,

''Who are you?'' -- Learning Person Specific

Classifiers from Video, In CVPR 2009.

Buehler, Everingham, and Zisserman "Learning

sign language by watching TV (using weakly

aligned subtitles)", In CVPR 2009.

Duchenne, Laptev, Sivic, Bach and Ponce,

"Automatic Annotation of Human Actions in

Video", In ICCV 2009.

…wanted to know about the history of the trees

Page 28: Computer Vision

Joint Learning of Actors and Actions

Rick? Rick?

Walks?Walks?

[Bojanowski et al. ICCV 2013]

Rick walks up behind Ilsa

Page 29: Computer Vision

Rick

Walks

Rick walks up behind Ilsa

Joint Learning of Actors and Actions[Bojanowski et al. ICCV 2013]

Page 30: Computer Vision

Formulation: Cost function

Rick

Ilsa

Sam

Actor labels Actor image features

Actor classifier

Page 31: Computer Vision

Formulation: Cost function

Person p appears at

least once in clip N :

p = Rick

Weak supervision

from scripts:

Page 32: Computer Vision

Action a appears at

least once in clip N :

a = Walk

Weak supervision

from scripts:

Formulation: Cost function

Page 33: Computer Vision

Formulation: Cost function

Action a

appears

in clip N :

Weak supervision

from scripts:

Person p

appears in

clip N :

Person p

and

Action a

appear in

clip N :

Page 34: Computer Vision

34

Image and video features

• Facial features

[Everingham’06]

• HOG descriptor on

normalized face image

• Dense Trajectory

features in person

bounding box

[Wang et al.,’11]

Face features

Action features

Page 35: Computer Vision

35

Results for Person Labelling

American beauty (11 character names)Casablanca (17 character names)

Page 36: Computer Vision

36

Results for Person + Action Labelling

Casablanca,

Walking

Page 37: Computer Vision

Finding Actions and Actors in Movies

[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

Page 38: Computer Vision

38

Action Learning with

Ordering Constraints[Bojanowski et al. ECCV 2014]

Page 39: Computer Vision

39

Action Learning with

Ordering Constraints[Bojanowski et al. ECCV 2014]

Page 40: Computer Vision

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Page 41: Computer Vision

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Page 42: Computer Vision

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Page 43: Computer Vision

Is the optimization tractable?

• Path constraints are implicit

• Cannot use off-the-shelf solvers

• Frank-Wolfe optimization algorithm

Page 44: Computer Vision

Results

937 video clips from 60 Hollywood movies•

16 action classes•

Each clip is annotated by a sequence of n actions (2≤n≤11)•

Page 45: Computer Vision
Page 46: Computer Vision

Object recognition

Page 47: Computer Vision

Convolutional Neural Networks

• ImageNet Large-Scale Visual Recognition Challenge is

very hard: 1000 classes, 1.2M images

• Krizhevsky et al. ILSVRC12 results improve other

methods with a large margin

2012

2014

GoogleLeNet: 6%

Page 48: Computer Vision

CNN of Krizhevsky et al. NIPS’12

• Learns low-level features at

the first layer.

• Has some tricks but the main

principle is similar to LeCun’88

• Has 60M parameters and 650K neurons.

• Success seems to be determined by (a) lots of labeled

images and (b) very fast GPU implementation. Both (a)

and (b) have not been available until very recently.

Page 49: Computer Vision

Approach

1. Design training/test procedure using sliding windows

2. Train adaptation layers to map labelsSee also [Girshick et al.’13], [Donahue et al.’13], [Sermanet et al. ’14], [Zeiler and Fergus ’13]

Transfer learning workshop at ICCV’13, ImageNet workshop at ICCV’13

Page 50: Computer Vision

Approach – sliding window training / testing

Page 51: Computer Vision

Results

Object localization

Page 52: Computer Vision

Results

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

Page 53: Computer Vision

Results

Page 54: Computer Vision

Vision works?

Page 55: Computer Vision

Vision works?

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

Page 56: Computer Vision

VOC Action Classification Taster

Challenge

Given the bounding box of a person, predict

whether they are performing a given action

Playing Instrument? Reading?

Encourage research on still-image activity

recognition: more detailed understanding of image

Page 57: Computer Vision

Nine Action Classes

Phoning Playing Instrument Reading Riding Bike Riding Horse

Running Taking Photo Using Computer Walking

Page 58: Computer Vision

CNN action recognition and

localizationQualitative results: reading

Page 59: Computer Vision

CNN action recognition and

localizationQualitative results: phoning

Page 60: Computer Vision

CNN action recognition and

localizationQualitative results: playing instrument

Page 61: Computer Vision

Results PASCAL VOC 2012

Object classification

Action classification

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

Page 62: Computer Vision

Are bounding boxes needed for training CNNs?

Image-level labels: Bicycle, Person[Oquab, Bottou, Laptev, Sivic, 2014]

Page 63: Computer Vision

Motivation: labeling bounding boxes is tedious

Page 64: Computer Vision

Motivation: image-level labels are plentiful

“Beautiful red leaves in a back street of Freiburg”

[Kuznetsova et al., ACL 2013]

http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html

Page 65: Computer Vision

Motivation: image-level labels are plentiful

“Public bikes in Warsaw during night”

https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/

Page 66: Computer Vision

Let the algorithm localize the object in the image

[Oquab, Bottou, Laptev, Sivic, 2014]

Example training images with bounding boxes

The locations of objects or their parts learnt by the CNN

NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object

localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …

Page 67: Computer Vision

Approach: search over object’s location

1. Efficient window sliding to find object location hypothesis

2. Image-level aggregation (max-pool)

3. Multi-label loss function (allow multiple objects in image)

See also [Sermanet et al. ’14] and [Chaftield et al.’14]

Max-pool

over image

Per-image score

FCaFCbC1-C2-C3-C4-C5 FC6 FC7

4096-dim

vector9216-dim

vector

4096-dim

vector

motorbike

person

diningtable

pottedplant

chair

car

bus

train

Max

Page 68: Computer Vision

1. Efficient window sliding to find object location

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

192normpool1:8

3256

normpool1:16

3841:16

3841:16

6144dropout

1:32

6144dropout

1:32

2048dropout

1:32

201:32

20final-pool

Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)

Adaptation layerstrained on Pascal VOC.

256pool1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.

Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.

In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.

In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].

3 Network architecture for weakly supervised learning

We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.

3

Page 69: Computer Vision

2. Image-level aggregation using global max-pool

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

192normpool1:8

3256

normpool1:16

3841:16

3841:16

6144dropout

1:32

6144dropout

1:32

2048dropout

1:32

201:32

20final-pool

Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)

Adaptation layerstrained on Pascal VOC.

256pool1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.

Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.

In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.

In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].

3 Network architecture for weakly supervised learning

We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.

3

Page 70: Computer Vision

3. Multi-label loss function

(to allow for multiple objects in image)108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

192normpool1:8

3256

normpool1:16

3841:16

3841:16

6144dropout

1:32

6144dropout

1:32

2048dropout

1:32

201:32

20final-pool

Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)

Adaptation layerstrained on Pascal VOC.

256pool1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.

Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.

In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.

In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].

3 Network architecture for weakly supervised learning

We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.

3

Sum of K (=20) log-loss functions, one for each of K classes:

K-vector of network output

for image x

K-vector of (+1,-1) labels indicating

presence/absence of each class

Page 71: Computer Vision

Search for objects using max-poolingMax-pooling search

aeroplane map

car map

«Keep up the good work !»

(increase score)

«Wrong !»(decrease score)

«Found something there !» Receptive field of the maximum-

scoring neuron

max-pool

max-pool

mardi 10 juin 14

Correct label:

increase score

for this class

Incorrect label:

decrease score

for this class

Page 72: Computer Vision

Search for objects using max-pooling

a

What is the effect of errors?

Page 73: Computer Vision

Multi-scale training and testing162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

Rescale

[  0.7…1.4  ]

chairdiningtablesofapottedplantpersoncarbustrain…

Figure 3: Weakly supervised training

chairdiningtablepersonpottedplantpersoncarbustrain…

Rescale

Figure 4: Multiscale object recognition

Convolutional adaptation layers. The network architecture of [26] assumes a fixed-size imagepatch of 224⇥224 RGB pixelsas input and outputs a1⇥1⇥N vector of per-class scores asoutput,whereN is the number of classes. The aim is to apply the network to bigger images in a slidingwindow manner thus extending its output to n ⇥m ⇥N where n and m denote the number ofsliding window positions in the x- and y- direction in the image, respectively, computing the Nper-class scores at all input window positions. While this type of sliding was performed in [26] byapplying the network to independently extracted image patches, here we achieve the same effect bytreating thefully connected adaptation layersasconvolutions. For agiven input imagesize, the fullyconnected layer can be seen as a special case of a convolution layer where the size of the kernel isequal to the size of the layer input. With this procedure the output of the final adaptation layer FC7becomes a2⇥2⇥N output score map for a256⇥256 RGB input image, asshown in Figure 2. Asthe global stride of the network is 32 pixels, adding 32 pixels to the image width or height increasesthe width or height of the output score map by one. Hence, for example, a 2048⇥1024 pixel inputwould lead to a 58⇥26 output score map containing the score of the network for all classes for thedifferent locations of the input 224⇥224 window with astride of 32 pixels. While this architectureis typically used for efficient classification at test time, see e.g. [32], here we also use it at trainingtime (as discussed in Section 4) to efficiently examine the entire image for possible locations of theobject during weakly supervised training.

Explicit search for object’s position via max-pooling. The aim is to output a single image-levelscore for each of the object classes independently of the input image size. This is achieved byaggregating the n ⇥m ⇥N matrix of output scores for n ⇥m different positions of the inputwindow using aglobal max-pooling operation into asingle1⇥1⇥N vector, whereN is thenumberof classes. Note that the max-pooling operation effectively searches for the best-scoring candidateobject position within the image, which is crucial for weakly supervised learning where the exactposition of the object within the image is not given at training. In addition, due to the max-poolingoperation the output of the network becomes independent of the size of the input image, which willbeused for multi-scale learning in Section 4.

Multi-label classification cost function. The Pascal VOC classification task consists in tellingwhether at least one instance of aclass is present in the image or not. We treat the task as a separate

4

Page 74: Computer Vision

Training

videos

Page 75: Computer Vision

Test results on 80 classes in Microsoft COCO dataset

Page 76: Computer Vision

Test results on 80 classes in Microsoft COCO dataset

Page 77: Computer Vision

Test results on 80 classes in Microsoft COCO dataset

Page 78: Computer Vision

Test results on 80 classes in Microsoft COCO dataset

Page 79: Computer Vision

Test results on 80 classes in Microsoft COCO dataset

Page 80: Computer Vision

Test results on 80 classes in Microsoft COCO dataset