w;h - florian-strub.comflorian-strub.com/modulating-early-visual-poster.pdf · Modulatingearlyvisualprocessingbylanguage Harm de Vries1, Florian Strub2, Jeremie Mary2, Hugo Larochelle

ModulatingearlyvisualprocessingbylanguageHarm de Vries*1, Florian Strub*2, Jeremie Mary2, Hugo Larochelle 1,3, Olivier Pietquin1,5, Aaron Courville1,4

1MILA, Université of Montréal - 2Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRIStAL - 3Google Brain - 4CIFAR - 5DeepMind

Rethinking language-vision tasksPremise In language-vision tasks (VQA, image captioning,instruction following), the classic pipeline processes the visualand linguistic inputs independently before fusing them into asingle representation. This joint-embedding is then used tosolve the task at hand.

Claim Linguistic input should modulate the visual process-ing from the very beginning to more effectively fuse bothmodalities and to obtain a better joint-embedding.

Solution We introduce Conditional Batch Normalization asa modulation mechanism to alter activations of a pre-trainedResNet conditioned on a language embedding.

Results We show strong improvements on the VQA andGuessWhat?! datasets and find that early visual modulationis beneficial.

(Left) Classic language-vision tasks pipeline. (Right) Ourproposed approach.

Conditional BatchNormThe idea is to condition the affine scaling parameters of aBatch Normalization (BN) layer, γ and β, with an externalinput eq. When applied to a pre-trained convnet, we predicta change ∆βc and ∆γc from pre-initialized BN parameters.

We refer to Fi,c,w,h as feature map of the ith input sampleof the cth feature map at location (w, h). Given a mini-batchB = {Fi,·,·,·}Ni=1 of N examples, Conditional Batch Nor-malization (CBN) normalizes the feature maps at trainingtime as follows:

∆β = MLP (eq) ∆γ = MLP (eq)

CBN(Fi,c,h,w) = (γc + ∆γc)Fi,c,w,h − EB[F·,c,·,·]√

VarB[F·,c,·,·] + ε+ (βc + ∆γc),

CBN is a powerful method to modulate neural activations as itenables an external embedding to manipulate entire feature maps:

� by scaling them up or down if γc > 0

� by shifting them βc 6= 0

� by shutting them off if γc = 0

Modulating ResNetIn order to modulate the visual pipeline, we condition the BNparameters of a pre-trained ResNet on a language embeddingobtained from a recurrent network. We train end-to-end butwe stress that we freeze all ResNet parameters, including γand β, during training.

We apply CBN to a pretrained ResNet-50, leading to theMODulatEd Residual Network (MODERN). To verify thatthe gains from MODERN are not coming from increasedmodel capacity, we include two baselines with more capacity:

� Ft Stage 4: when finetuning the layers of stage 4 ofResNet-50

� Ft BN: when finetuning all β and γ parameters of ResNet-50, while freezing all its weights.

VQA

What color are her eyes? BrownWhat is the mustache made of? BananaIs he a boy? Yes

GuessWhat?!

Is it a vase? YesIs it on the left corner? NoIs it the turquoise and purple one? Yes

VQA ResultsAlthough MODERN can be combined with any existing VQA architecture, in this work we plug it into a original VQAarchitecture with either a classic spatial attention mechanism or a 2-glimpse attention mechanism [2]. Models are trained onthe training set with early stopping on the validation set and accuracies are reported on test-dev set.

Answer type Yes/No Number Other Overall

224x

224 Baseline 79.45% 36.63% 44.62% 58.05%

Ft Stage 4 78.37% 34.27% 43.72% 56.91%Ft BN 80.18% 35.98% 46.07% 58.98%MODERN 81.17% 37.79% 48.66% 60.82%

448x

448

MRN [2] with ResNet-50 80.20% 37.73% 49.53% 60.84%MRN [2] with ResNet-152 80.95% 38.39% 50.59% 61.73%MCB [1] with ResNet-50 60.46% 38.29% 48.68% 60.46%MCB [1] with ResNet-152 - - - 62.50%MODERN 81.38% 36.06% 51.64% 62.16%MODERN + MRN [2] 82.17% 38.06% 52.29% 63.01%

CBN applied to Val. accuracy∅ 56.12%Stage 4 57.68%Stages 3− 4 58.29%Stages 2− 4 58.32%All 58.56%

GuessWhat?! ResultsWe use the oracle model as defined in the original GuessWhat?! paper with the (modulated) cropped object features, theobject category, its spatial location and the question embedding as input. Models are trained on the training set with earlystopping on the validation sets and error are reported on test set.

Raw ResNet ft stage4 Ft BN CBNCrop 29.92% 27.48% 27.94% 25.06%Crop + Spatial + Cat. 22.55% 22.68% 22.42% 19.52%

Spatial + Category 21.5%

CBN applied to Test error∅ 29.92%Stage 4 26.42%Stages 3− 4 25.24%Stages 2− 4 25.31%All 25.06%

Modulated ResNet features disentangle answer typesFor VQA, we show a t-SNE plot of 1000 raw (Left) and modulated (Right) ResNet-50 features.

References[1] Multimodal compact bilinear pooling for visual question answering and visual grounding. A. Fukui et Al. In Proc. of EMNLP, 2016.[2] Hadamard product for low-rank bilinear pooling. J. Kim et Al. In Proc. of ICLR, 2017.

w;h - florian-strub.comflorian-strub.com/modulating-early-visual-poster.pdf · Modulatingearlyvisualprocessingbylanguage Harm de Vries*1, Florian Strub*2, Jeremie Mary2, Hugo Larochelle

Documents

w;h - florian-strub.comflorian-strub.com/modulating-early-visual-poster.pdf · Modulatingearlyvisualprocessingbylanguage Harm de Vries1, Florian Strub2, Jeremie Mary2, Hugo Larochelle