Ivan Laptev [email protected]WILLOW, INRIA/ENS/CNRS, Paris Computer Vision: Weakly - supervised learning from video and images CSClub Saint Petersburg November 17, 2014 Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab – Francis Bach – Leon Bottou – Jean Ponce – Cordelia Schmid – Josef Sivic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
VisionLabs – команда профессионалов, обладающихзначительными знаниями и существенным практическимопытом в сфере разработки алгоритмов компьютерногозрения и интеллектуальных систем.
Мы создаем и внедряем технологии компьютерного зрения, открывая новые
возможности для изменения окружающего нас мира к лучшему.
- Scripts are available for 1000’s of hours of movies and TV-series
- Scripts describe dynamic and static content of videos
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
21
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
22
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
23
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
24
…
1172
01:20:17,240 --> 01:20:20,437
Why weren't you honest with me?
Why'd you keep your marriage a secret?
1173
01:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.
Victor wanted it that way.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
…
…
RICK
Why weren't you honest with me? Why
did you keep your marriage a secret?
Rick sits down with Ilsa.
ILSA
Oh, it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
…
01:20:17
01:20:23
subtitles movie script
• Scripts available for >500 movies (no time synchronization)
Let the algorithm localize the object in the image
[Oquab, Bottou, Laptev, Sivic, 2014]
Example training images with bounding boxes
The locations of objects or their parts learnt by the CNN
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
Approach: search over object’s location
1. Efficient window sliding to find object location hypothesis
2. Image-level aggregation (max-pool)
3. Multi-label loss function (allow multiple objects in image)
See also [Sermanet et al. ’14] and [Chaftield et al.’14]
Max-pool
over image
Per-image score
FCaFCbC1-C2-C3-C4-C5 FC6 FC7
4096-dim
vector9216-dim
vector
4096-dim
vector
…
motorbike
person
diningtable
pottedplant
chair
car
bus
train
…
Max
1. Efficient window sliding to find object location
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
…
2. Image-level aggregation using global max-pool
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
…
3. Multi-label loss function
(to allow for multiple objects in image)108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
Sum of K (=20) log-loss functions, one for each of K classes:
K-vector of network output
for image x
K-vector of (+1,-1) labels indicating
presence/absence of each class
Search for objects using max-poolingMax-pooling search
aeroplane map
car map
«Keep up the good work !»
(increase score)
«Wrong !»(decrease score)
«Found something there !» Receptive field of the maximum-
Convolutional adaptation layers. The network architecture of [26] assumes a fixed-size imagepatch of 224⇥224 RGB pixelsas input and outputs a1⇥1⇥N vector of per-class scores asoutput,whereN is the number of classes. The aim is to apply the network to bigger images in a slidingwindow manner thus extending its output to n ⇥m ⇥N where n and m denote the number ofsliding window positions in the x- and y- direction in the image, respectively, computing the Nper-class scores at all input window positions. While this type of sliding was performed in [26] byapplying the network to independently extracted image patches, here we achieve the same effect bytreating thefully connected adaptation layersasconvolutions. For agiven input imagesize, the fullyconnected layer can be seen as a special case of a convolution layer where the size of the kernel isequal to the size of the layer input. With this procedure the output of the final adaptation layer FC7becomes a2⇥2⇥N output score map for a256⇥256 RGB input image, asshown in Figure 2. Asthe global stride of the network is 32 pixels, adding 32 pixels to the image width or height increasesthe width or height of the output score map by one. Hence, for example, a 2048⇥1024 pixel inputwould lead to a 58⇥26 output score map containing the score of the network for all classes for thedifferent locations of the input 224⇥224 window with astride of 32 pixels. While this architectureis typically used for efficient classification at test time, see e.g. [32], here we also use it at trainingtime (as discussed in Section 4) to efficiently examine the entire image for possible locations of theobject during weakly supervised training.
Explicit search for object’s position via max-pooling. The aim is to output a single image-levelscore for each of the object classes independently of the input image size. This is achieved byaggregating the n ⇥m ⇥N matrix of output scores for n ⇥m different positions of the inputwindow using aglobal max-pooling operation into asingle1⇥1⇥N vector, whereN is thenumberof classes. Note that the max-pooling operation effectively searches for the best-scoring candidateobject position within the image, which is crucial for weakly supervised learning where the exactposition of the object within the image is not given at training. In addition, due to the max-poolingoperation the output of the network becomes independent of the size of the input image, which willbeused for multi-scale learning in Section 4.
Multi-label classification cost function. The Pascal VOC classification task consists in tellingwhether at least one instance of aclass is present in the image or not. We treat the task as a separate
4
Training
videos
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset