Computer Vision

Ivan Laptev

[email protected]

WILLOW, INRIA/ENS/CNRS, Paris

Computer Vision:

Weakly-supervised learning from

video and images

CSClub

Saint Petersburg

November 17, 2014

Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab –

Francis Bach – Leon Bottou – Jean Ponce –

Cordelia Schmid – Josef Sivic

Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891

VisionLabs – команда профессионалов, обладающихзначительными знаниями и существенным практическимопытом в сфере разработки алгоритмов компьютерногозрения и интеллектуальных систем.

Мы создаем и внедряем технологии компьютерного зрения, открывая новые

возможности для изменения окружающего нас мира к лучшему.

О компании

– Advertisement –

http://visionlabs.ru/

mailto:[email protected]

Команда

АлександрХанинChief

ExecutiveOfficer

АлексейНехаев

ExecutiveOfficer

Слава Казьмин

ChiefTechnical

Officer

ИванЛаптев

Scientificadvisor

СергейМиляев

SeniorCV engineer

АлексейКордичевFinancialadvisor

ИванТрусковSoftwaredeveloper

СергейЧерепанов

Softwaredeveloper

Наша команда –симбиоз науки и бизнеса

Направления деятельностиТехнология распознавания лицСистема выявления мошенников в банках

Технология распознавания номеровСистема учета и автоматизации доступа транспорта

Технологии для безопасного городаСистема выявления нарушений и опасных ситуаций



Проекты масштаба государства

Достижения


Мы ищем единомышленников

Создание и внедрение интеллектуальных систем

Решение интересных практических задач

Работа в дружной амбициозной команде

Спасибо за внимание!

Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891

http://visionlabs.ru/

mailto:[email protected]

What is Computer Vision?

7

What is Computer Vision?

What is the recent progress?

1990s:

Recognition at the level of a few

toy objects (COIL 20 dataset)

ResearchIndustry

Automated quality inspection

(controlled lighting, scale,…)

Now:

Face recognition in social media ImageNet: 14M images, 21K classes

6% Top-5 error rate in 2014 Challenge

~5K image uploads

every min. >34K hours of video

upload every day

TV-channels recorded

since 60’s

~30M surveillance cameras in US

=> ~700K video hours/day

~2.5 Billion new

images / month

And even more with future

wearable devices

Why image and video analysis?Data:

Movies TV

YouTube

Why looking at people?

How many person-pixels are in the video?

Movies TV

YouTube

How many person-pixels are in the video?

40%

35% 34%

Why looking at people?

How many person pixels

in our daily life?

Wearable camera data: Microsoft SenseCam dataset

How many person pixels

in our daily life?

Wearable camera data: Microsoft SenseCam dataset

~4%

Large variations in appearance:occlusions, non-rigid motion, view-point changes, clothing…

What are the difficulties?

Manual collection of training samples is prohibitive: many action classes, rare occurrence

Action vocabulary is not well-defined

…

Action Open:

…

…

Action Hugging:

This talk:

Brief overview of recent techniques

Weakly-supervised learning from video

and scripts

Weakly-supervised learning with

convolutional neural networks

Standard visual recognition pipeline

GetOutCar AnswerPhone

Kiss

HandShake StandUp

DriveCar

Collect image/video samples and corresponding class labels

Design appropriate data representation, with certain invariance properties

Design / use existing machine learning methods for learning and classification

Occurrence histogram

of visual words

space-time patches

Extraction of

Local features

Feature

description

K-means

clustering

(k=4000)

Feature

quantization

Non-linear

SVM with χ2

kernel

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

Bag-of-Features action recognition

Action classification

Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,

“Indiana Jones and the Last Crusade”

Where to get training data?

Shoot actions in the lab•

KTH dataset

Weizman dataset,…

- Limited variability

- Unrealistic

Manually annotate existing content•

HMDB, Olympic Sports,

UCF50, UCF101, …

- Very time-consuming

Use readily-available video scripts•

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com

- Scripts are available for 1000’s of hours of movies and TV-series

- Scripts describe dynamic and static content of videos

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

21

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

22


23


24

…

1172

01:20:17,240 --> 01:20:20,437

Why weren't you honest with me?

Why'd you keep your marriage a secret?

1173

01:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.

Victor wanted it that way.

1174

01:20:23,800 --> 01:20:26,189

Not even our closest friends

knew about our marriage.

…

…

RICK

Why weren't you honest with me? Why

did you keep your marriage a secret?

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard.

Victor wanted it that way. Not even

our closest friends knew about our

marriage.

…

01:20:17

01:20:23

subtitles movie script

• Scripts available for >500 movies (no time synchronization)

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …

• Subtitles (with time info.) are available for the most of movies

• Can transfer time to scripts by text alignment

Script-based video annotation

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

Scripts as weak supervision

Un

ce

rta

inty

24:25

24:51

Imprecise temporal localization•

No explicit spatial localization •

NLP problems, scripts ≠ training labels•

“… Will gets out of the Chevrolet. …”

“… Erin exits her new truck…”vs. Get-out-car

Challenges:

Previous work

Sivic, Everingham, and Zisserman,

''Who are you?'' -- Learning Person Specific

Classifiers from Video, In CVPR 2009.

Buehler, Everingham, and Zisserman "Learning

sign language by watching TV (using weakly

aligned subtitles)", In CVPR 2009.

Duchenne, Laptev, Sivic, Bach and Ponce,

"Automatic Annotation of Human Actions in

Video", In ICCV 2009.

…wanted to know about the history of the trees

Joint Learning of Actors and Actions

Rick? Rick?

Walks?Walks?

[Bojanowski et al. ICCV 2013]

Rick walks up behind Ilsa

Rick

Walks

Rick walks up behind Ilsa

Joint Learning of Actors and Actions[Bojanowski et al. ICCV 2013]

Formulation: Cost function

Rick

Ilsa

Sam

Actor labels Actor image features

Actor classifier


Person p appears at

least once in clip N :

p = Rick

Weak supervision

from scripts:

Action a appears at

least once in clip N :

a = Walk

Weak supervision

from scripts:



Action a

appears

in clip N :

Weak supervision

from scripts:

Person p

appears in

clip N :

Person p

and

Action a

appear in

clip N :

34

Image and video features

• Facial features

[Everingham’06]

• HOG descriptor on

normalized face image

• Dense Trajectory

features in person

bounding box

[Wang et al.,’11]

Face features

Action features

35

Results for Person Labelling

American beauty (11 character names)Casablanca (17 character names)

36

Results for Person + Action Labelling

Casablanca,

Walking

Finding Actions and Actors in Movies

[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

38

Action Learning with

Ordering Constraints[Bojanowski et al. ECCV 2014]

39

Action Learning with

Ordering Constraints[Bojanowski et al. ECCV 2014]

Cost Function

Weak supervision from ordering constraints on Z:

Action

label

Action

index

2

4

1

2

3

2

Video time intervals

Cost Function


Action

label

Action

index

2

4

1

2

3

2


Cost Function


Action

label

Action

index

2

4

1

2

3

2


Is the optimization tractable?

• Path constraints are implicit

• Cannot use off-the-shelf solvers

• Frank-Wolfe optimization algorithm

Results

937 video clips from 60 Hollywood movies•

16 action classes•

Each clip is annotated by a sequence of n actions (2≤n≤11)•

Object recognition

Convolutional Neural Networks

• ImageNet Large-Scale Visual Recognition Challenge is

very hard: 1000 classes, 1.2M images

• Krizhevsky et al. ILSVRC12 results improve other

methods with a large margin

2012

2014

GoogleLeNet: 6%

CNN of Krizhevsky et al. NIPS’12

• Learns low-level features at

the first layer.

• Has some tricks but the main

principle is similar to LeCun’88

• Has 60M parameters and 650K neurons.

• Success seems to be determined by (a) lots of labeled

images and (b) very fast GPU implementation. Both (a)

and (b) have not been available until very recently.

Approach

1. Design training/test procedure using sliding windows

2. Train adaptation layers to map labelsSee also [Girshick et al.’13], [Donahue et al.’13], [Sermanet et al. ’14], [Zeiler and Fergus ’13]

Transfer learning workshop at ICCV’13, ImageNet workshop at ICCV’13

Approach – sliding window training / testing

Results

Object localization

Results

[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]

Results

Vision works?

Vision works?


VOC Action Classification Taster

Challenge

Given the bounding box of a person, predict

whether they are performing a given action

Playing Instrument? Reading?

Encourage research on still-image activity

recognition: more detailed understanding of image

Nine Action Classes

Phoning Playing Instrument Reading Riding Bike Riding Horse

Running Taking Photo Using Computer Walking

CNN action recognition and

localizationQualitative results: reading


localizationQualitative results: phoning


localizationQualitative results: playing instrument

Results PASCAL VOC 2012

Object classification

Action classification


Are bounding boxes needed for training CNNs?

Image-level labels: Bicycle, Person[Oquab, Bottou, Laptev, Sivic, 2014]

Motivation: labeling bounding boxes is tedious

Motivation: image-level labels are plentiful

“Beautiful red leaves in a back street of Freiburg”

[Kuznetsova et al., ACL 2013]

http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html

Motivation: image-level labels are plentiful

“Public bikes in Warsaw during night”

https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/

Let the algorithm localize the object in the image

[Oquab, Bottou, Laptev, Sivic, 2014]

Example training images with bounding boxes

The locations of objects or their parts learnt by the CNN

NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object

localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …

Approach: search over object’s location

1. Efficient window sliding to find object location hypothesis

2. Image-level aggregation (max-pool)

3. Multi-label loss function (allow multiple objects in image)

See also [Sermanet et al. ’14] and [Chaftield et al.’14]

Max-pool

over image

Per-image score

FCaFCbC1-C2-C3-C4-C5 FC6 FC7

4096-dim

vector9216-dim

vector

4096-dim

vector

…

motorbike

person

diningtable

pottedplant

chair

car

bus

train

…

Max

1. Efficient window sliding to find object location

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

192normpool1:8

3256

normpool1:16

3841:16

3841:16

6144dropout

1:32

6144dropout

1:32

2048dropout

1:32

201:32

20final-pool

Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)

Adaptation layerstrained on Pascal VOC.

256pool1:32

C1 C2 C3 C4 C5 FC6 FC7 FCa FCb

Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.

Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.

In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.

In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].

3 Network architecture for weakly supervised learning

We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.

3

…

2. Image-level aggregation using global max-pool

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

192normpool1:8

3256

normpool1:16

3841:16

3841:16

6144dropout

1:32

6144dropout

1:32

2048dropout

1:32

201:32

20final-pool



256pool1:32








3

…

3. Multi-label loss function

(to allow for multiple objects in image)108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

192normpool1:8

3256

normpool1:16

3841:16

3841:16

6144dropout

1:32

6144dropout

1:32

2048dropout

1:32

201:32

20final-pool



256pool1:32








3

Sum of K (=20) log-loss functions, one for each of K classes:

K-vector of network output

for image x

K-vector of (+1,-1) labels indicating

presence/absence of each class

Search for objects using max-poolingMax-pooling search

aeroplane map

car map

«Keep up the good work !»

(increase score)

«Wrong !»(decrease score)

«Found something there !» Receptive field of the maximum-

scoring neuron

max-pool

max-pool

mardi 10 juin 14

Correct label:

increase score

for this class

Incorrect label:

decrease score

for this class

Search for objects using max-pooling

a

What is the effect of errors?

Multi-scale training and testing162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

Rescale

[ 0.7…1.4 ]

chairdiningtablesofapottedplantpersoncarbustrain…

Figure 3: Weakly supervised training

chairdiningtablepersonpottedplantpersoncarbustrain…

Rescale

Figure 4: Multiscale object recognition

Convolutional adaptation layers. The network architecture of [26] assumes a fixed-size imagepatch of 224⇥224 RGB pixelsas input and outputs a1⇥1⇥N vector of per-class scores asoutput,whereN is the number of classes. The aim is to apply the network to bigger images in a slidingwindow manner thus extending its output to n ⇥m ⇥N where n and m denote the number ofsliding window positions in the x- and y- direction in the image, respectively, computing the Nper-class scores at all input window positions. While this type of sliding was performed in [26] byapplying the network to independently extracted image patches, here we achieve the same effect bytreating thefully connected adaptation layersasconvolutions. For agiven input imagesize, the fullyconnected layer can be seen as a special case of a convolution layer where the size of the kernel isequal to the size of the layer input. With this procedure the output of the final adaptation layer FC7becomes a2⇥2⇥N output score map for a256⇥256 RGB input image, asshown in Figure 2. Asthe global stride of the network is 32 pixels, adding 32 pixels to the image width or height increasesthe width or height of the output score map by one. Hence, for example, a 2048⇥1024 pixel inputwould lead to a 58⇥26 output score map containing the score of the network for all classes for thedifferent locations of the input 224⇥224 window with astride of 32 pixels. While this architectureis typically used for efficient classification at test time, see e.g. [32], here we also use it at trainingtime (as discussed in Section 4) to efficiently examine the entire image for possible locations of theobject during weakly supervised training.

Explicit search for object’s position via max-pooling. The aim is to output a single image-levelscore for each of the object classes independently of the input image size. This is achieved byaggregating the n ⇥m ⇥N matrix of output scores for n ⇥m different positions of the inputwindow using aglobal max-pooling operation into asingle1⇥1⇥N vector, whereN is thenumberof classes. Note that the max-pooling operation effectively searches for the best-scoring candidateobject position within the image, which is crucial for weakly supervised learning where the exactposition of the object within the image is not given at training. In addition, due to the max-poolingoperation the output of the network becomes independent of the size of the input image, which willbeused for multi-scale learning in Section 4.

Multi-label classification cost function. The Pascal VOC classification task consists in tellingwhether at least one instance of aclass is present in the image or not. We treat the task as a separate

4

Training

videos

Test results on 80 classes in Microsoft COCO dataset






Computer Vision

Education

available video

hours of video

headwaiter seats ilsa

video analysis

video hoursday

person pixels

conscious effort

training data