Top Banner
arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting Visual Question Answering Baselines Allan Jabri, Armand Joulin, and Laurens van der Maaten Facebook AI Research Abstract. Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reason- ing”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question fea- tures to predict the answers. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of 65.8% on the Visual7W Telling task and competes surprisingly well with the most complex sys- tems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of our model and study the transferability of our model between both datasets. We also present an error analysis of our best model, the results of which suggests that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers. Keywords: Visual question answering, dataset bias. 1 Introduction Recent advances in computer vision have brought us close to the point where traditional object-recognition benchmarks such as Imagenet are considered to be “solved” [1,2]. These advances, however, also prompt the question how we can move from object recognition to visual understanding ; that is, how we can extend today’s recognition systems that provide us with “words” describing an image or an image region to systems that can produce a deeper semantic representation of the image content. Because benchmarks have traditionally been a key driver for progress in computer vision, several recent studies have proposed methodologies to assess our ability to develop such representations. These proposals include modeling relations between objects [3], visual Turing tests [4], and visual question answering [5,6,7,8]. The task of Visual Question Answering (VQA) is to answer questions—posed in natural language—about an image by providing an answer in the form of a
13

Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Jul 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

arX

iv:1

606.

0839

0v1

[cs

.CV

] 2

7 Ju

n 20

16Revisiting Visual Question Answering Baselines

Allan Jabri, Armand Joulin, and Laurens van der Maaten

Facebook AI Research

Abstract. Visual question answering (VQA) is an interesting learningsetting for evaluating the abilities and shortcomings of current systemsfor image understanding. Many of the recently proposed VQA systemsinclude attention or memory mechanisms designed to perform “reason-ing”. Furthermore, for the task of multiple-choice VQA, nearly all ofthese systems train a multi-class classifier on image and question fea-tures to predict the answers. This paper questions the value of thesecommon practices and develops a simple alternative model based onbinary classification. Instead of treating answers as competing choices,our model receives the answer as input and predicts whether or not animage-question-answer triplet is correct. We evaluate our model on theVisual7W Telling and the VQA Real Multiple Choice tasks, and findthat even simple versions of our model perform competitively. Our bestmodel achieves state-of-the-art performance of 65.8% on the Visual7WTelling task and competes surprisingly well with the most complex sys-tems proposed for the VQA Real Multiple Choice task. Additionally, weexplore variants of our model and study the transferability of our modelbetween both datasets. We also present an error analysis of our bestmodel, the results of which suggests that a key problem of current VQAsystems lies in the lack of visual grounding and localization of conceptsthat occur in the questions and answers.

Keywords: Visual question answering, dataset bias.

1 Introduction

Recent advances in computer vision have brought us close to the point wheretraditional object-recognition benchmarks such as Imagenet are considered tobe “solved” [1,2]. These advances, however, also prompt the question how we canmove from object recognition to visual understanding; that is, how we can extendtoday’s recognition systems that provide us with “words” describing an image oran image region to systems that can produce a deeper semantic representation ofthe image content. Because benchmarks have traditionally been a key driver forprogress in computer vision, several recent studies have proposed methodologiesto assess our ability to develop such representations. These proposals includemodeling relations between objects [3], visual Turing tests [4], and visual questionanswering [5,6,7,8].

The task of Visual Question Answering (VQA) is to answer questions—posedin natural language—about an image by providing an answer in the form of a

Page 2: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

2 Jabri, Joulin, and van der Maaten

What color is thejacket?

How many cars areparked?

What event is this?When is this scenetaking place?

-Red and blue. -Four. -A wedding. -Day time.

-Yellow. -Three. -Graduation. -Night time.

-Black. -Five. -A funeral. -Evening.

-Orange. -Six. -A picnic. -Morning.

Fig. 1. Four images with associated questions and answers from the Visual7W dataset.Correct answers are typeset in green.

short text. This answer can either be selected from multiple pre-specified choicesor be generated by the system. As illustrated by the examples from the Visual7Wdataset [9] in Figure 1, VQA naturally combines computer vision with naturallanguage processing and reasoning, which makes it a good way to study progresson the path from computer vision to more general artificially intelligent systems.

VQA seems to be a natural playground to develop approaches able to per-form basic “reasoning” about an image. Recently, many studies have exploredthis direction by adding simple memory or attention-based components to VQAsystems. While in theory, these approaches have the potential to perform sim-ple reasoning, it is not clear if they do actually reason, or if they do so in ahuman-comprehensible way. For example, Das et al. [10] recently reported that“machine-generated attention maps are either negatively correlated with humanattention or have positive correlation worse than task-independent saliency”. Inthis work, we also question the significance of the performance obtained by cur-rent “reasoning”-based systems. In particular, this study sets out to answer asimple question: are these systems better than baselines designed to solely cap-ture the dataset bias of standard VQA datasets? We limit the scope of our studyto multiple-choice questions, as this allows us to perform a more controlled studythat is not hampered by the tricky nuances of evaluating generated text [11,12].

We perform experimental evaluations on the Visual7W dataset [8] and theVQA dataset [5] to evaluate the quality of our baseline models. We: (1) studyand model the bias in the Visual7W Telling and VQA Multiple Choice datasets,(2) measure the effect of using visual features from different CNN architectures,(3) explore the use of a LSTM as the system’s language model, and (4) studytransferability of our model between datasets.

Our best baseline model outperforms the current state-of-the-art on the Vi-sual7W telling task with a performance of 65.8%, and competes surprisingly wellwith the most complex systems proposed for the VQA dataset. Furthermore, ourmodels perform competitively even with missing information (that is, missingimages, missing questions, or both). Taken together, our results suggests thatthe performance of current VQA systems is not significantly better than that ofsystems designed to exploit dataset biases.

Page 3: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Revisiting Visual Question Answering Baselines 3

2 Related work

The recent surge of studies on visual question answering has been fueled bythe release of several visual question-answering datasets, most prominently, theVQA dataset [5], the Visual Madlibs Q&A dataset [7], the Toronto COCO-QAdataset [6], and the Visual7W dataset [8]. Most of these datasets were developedby annotating subsets of the COCO dataset [13]. Geman et al. [4] proposed avisual Turing test in which the questions are automatically generated and requireno natural language processing. Current approaches to visual question answeringcan be subdivided into “generation” and “classification” models:

Generation models. Malinowski et al. [14] train a LSTM model to generatethe answer after receiving the image features (obtained from a convolutional net-work) and the question as input. Wu et al. [15] extend a LSTM generation modelto use external knowledge that is obtained from DBpedia [16]. Gao et al. [17]study a similar model but decouple the LSTMs used for encoding and decoding.Whilst generation models are appealing because they can generate arbitrary an-swers (also answers that were not observed during training), in practice, it is verydifficult to jointly learn the encoding and decoding models from the question-answering datasets of limited size. In addition, the evaluation of the quality ofthe generated text is complicated in practice [11,12].

Classification models. Zhou et al. [9] study an architecture in which imagefeatures are produced by a convolutional network, question features are producedby averaging word embeddings over all words in the question, and a multi-classlogistic regressor is trained on the concatenated features; the unique answers aretreated as outputs of the classification model. Zhu et al. [8] present a similarapproach, but they use an attentional LSTM model to represent the questioninstead of an average over word embeddings. Similar approaches are also studiedby Antol et al. [5] and Ren et al. [6]. Ma et al. [18] replace the LSTM encoder by aone-dimensional convolutional network that combines the word embeddings intoa question embedding. Andreas et al. [19] use a similar model but perform theimage processing using a compositional network whose structure is dynamicallydetermined based on a parse of the question. Fukui et al. [20] propose the useof “bilinear pooling” for combining multi-modal information. Lu et al. [21] jointlylearn a hierarchical attention mechanism based on parses of the question and theimage which they call “question-image co-attention”.

Our study is most closely related to a recent study by Shih et al. [22], which alsoconsiders models that treat the answer as an input variable and predicts whetheror not an image-question-answer triplet is correct. However, that study uses asubstantially more complex pipeline than this work: they consider a pipelinethat involves image-region selection.

Page 4: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

4 Jabri, Joulin, and van der Maaten

Fig. 2. Overview of our system for visual question answering. See text for details.

3 System Overview

Figure 2 provides an overview of the architecture of our visual question answer-ing system. The system takes an image-question-answer feature triplet as input.Unless otherwise stated (that is, in the LSTM experiment of Section 4), both thequestions and the answers are represented by averaging word2vec embeddingsover all words in the question or answer, respectively. The images are repre-sented using features computed by a pre-trained convolutional network. Unlessotherwise stated, we use the penultimate layer of Resnet-101 [2]. The word2vecembeddings are 300-dimensional and the image features are 2, 048-dimensional.The three feature sets are concatenated and used to train a classification modelthat predicts whether or not the image-question-answer triplet is correct.

The classification models we consider are logistic regressors and multilayerperceptrons (MLP) trained on the concatenated features, and bilinear modelsthat are trained on the answer features and a concatenation of the image andquestion features. The MLP has 8, 192 hidden units unless otherwise specified.We use dropout [23] after the first layer. We denote the image, question, andanswer features by xi, xq, and xa, respectively. Denoting the the sigmoid func-tion σ(x) = 1/(1 + exp(−x)) and the concatenation operator xiq = xi ⊕ xq, ourmodels are given by:

Linear: y = σ(Wxiqa + b)

Bilinear: y = σ(x⊤

iqWxa + b)

MLP: y = σ(W2 max(0,W1xiqa) + b).

The parameters of the classifier are learned by minimizing the binary logisticloss of predicting whether or not an image-question-answer triplet is correctusing stochastic gradient descent. The parameters of the convolutional networkwere learned by pre-training on the Imagenet dataset, following [24]. We didnot finetune the weights of the convolutional networks. We used pre-trainedword2vec [25] embeddings, which we did not finetune on VQA data either.

Page 5: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Revisiting Visual Question Answering Baselines 5

Method What Where When Who Why How Overall

LSTM (Q + I) [14] 48.9 54.4 71.3 58.1 51.3 50.3 52.1LSTM-Att [8] 51.5 57.0 75.0 59.5 55.5 49.8 55.6MCB [20] 60.3 70.4 79.5 69.2 58.2 51.1 62.2

MLP (A) 47.1 57.8 73.6 63.3 57.6 36.3 50.7MLP (A + Q) 54.9 60.0 76.0 65.9 63.8 40.2 56.1MLP (A + I) 60.0 73.9 80.0 70.3 63.9 36.6 61.0MLP (A + Q + I) 64.4 76.2 82.3 72.8 69.5 40.9 64.8

Table 1. Comparison of our models with the state-of-the-art for the Visual7W tellingtask [8]. Human accuracy on the task is 96.0%. Higher values are better.

Method Yes/No Number Other All

Two-Layer LSTM [5] 80.6 37.7 53.6 63.1Region selection [22] 77.2 33.5 56.1 62.4Question-Image Co-Attention [21] 80.0 39.5 59.9 66.1

MLP (A + Q + I) 80.8 17.6 62.0 65.2

Table 2. Comparison of our models with the state-of-the-art single models for theVQA Real Multiple Choice task [5]. Results are reported on the test2015-standardsplit. Human accuracy on the task is 83.3%. Higher values are better.

4 Experiments

We perform experiments on the following two datasets:

Visual7W Telling [8]. The dataset includes 69, 817 training questions, 28, 020validation questions, and 42, 031 test questions. Each question has four answerchoices. The negative choices are human-generated on a per-question basis. Theperformance is measured by the percentage of correctly answered questions.

VQA Real Multiple Choice [5]. The dataset includes 248, 349 questions fortraining, 121, 512 for validation, and 244, 302 for testing. Each question has 18answer choices. The negative choices are randomly sampled from a predefinedset of answers. Performance is measured following the metric proposed by [5].

4.1 Comparison with State-of-the-Art

We first compare the MLP variant of our baseline model with the state-of-the-art on both datasets. Table 1 shows the results of this comparison on Visual7W,using three variants of our baseline with different inputs: (1) answer and ques-tion (A+Q); (2) answer and image (A+I); (3) and all three inputs (A+Q+I).Our simple baseline achieves state-of-the-art performance when it has access toall the information. Interestingly, as shown by the results with the A+Q vari-ant of our model, simply exploiting the most frequent question-answer pairsobtains competitive performance. Surprisingly, even a variant of our model that

Page 6: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

6 Jabri, Joulin, and van der Maaten

is trained on just the answers already achieves a performance of 50.7%, simplyby learning biases in the answer distribution.

In Table 2, we also compare our models with the published state-of-the-art onthe VQA dataset. Despite its simplicity, our baseline achieves comparable per-formance with state-of-the-art models. We note that a very recent paper [20] ob-tained 70.1%, but that work used an ensemble of 7 models that are trained addi-tional data (the Visual Genome dataset [3]). Nonetheless, [20] performs only 5%better than our model whilst being substantially more complex.

4.2 Additional Experiments

In the following, we present the results of additional experiments to understandwhy our model performs relatively well, and when it fails. All evaluations areconducted on the Visual7W Telling dataset unless stated otherwise.

Dataset Model Softmax Binary

Visual7WLinear 41.6 42.7Bilinear – 61.8MLP 50.2 64.8

VQA MLP 61.1 64.9

Table 3. Accuracy of models us-ing either a softmax or a binary

loss. Results are presented for dif-ferent models using answer, ques-tion and image. On VQA, we usethe test2015-dev split. Higher val-ues are better.

During the daytime. On the bus stop bench. On a tree branch.

During daytime. Bus bench. On the tree branch.Outside, during the daytime. In front of the bus stop. The tree branch.Inside, during the daytime. The bus stop. Tree branch.In the daytime. At the bus stop. A tree branch.In the Daytime. The sign on the bus stop. Tree branches.

Table 4. The five most similar answers in the Visual7W dataset for three answersappearing in that dataset (in terms of cosine similarity between their feature vectors).

Does it help to consider the answer as an input? In Table 4.2, we presentthe results of experiments in which we compare the performance of our (binary)baseline model with variants of the model that predict softmax probabilities overa discrete set of the 5, 000 commonest answers, as is commonly done in mostprior studies, for instance, [9].

The results in the table show a substantial advantage of representing answersas inputs instead of outputs for the Visual7W Telling task and the VQA RealMultiple Choice task. Taking the answer as an input allows the system to modelthe similarity between different answers. For example, the answers “two people”and “two persons” are modeled by disjoint parameters in a softmax model;instead, the binary model will generally assign similar scores to these answersbecause they have similar bag-of-words word2vec representations.

Page 7: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Revisiting Visual Question Answering Baselines 7

Model AlexNet GoogLeNet ResNet-34 ResNet-50 ResNet-101

(dim.) (4,096) (1,792) (512) (2,048) (2,048)

Linear 42.7 42.9 42.7 42.7 42.8

Bilinear 54.9 57.0 58.5 60.7 61.8

MLP 61.6 62.1 63.9 64.2 64.8

Table 5. Accuracy on the Visual7W Telling task using visual features produced byfive different convolutional networks. Higher values are better.

To illustrate this, Table 4 shows examples of the similarities that can be cap-tured by the binary model. For a given answer, the table shows the five mostsimilar answers in the dataset based on cosine similarity between the feature vec-tors. The binary model can readily exploit these similarities, whereas a softmaxmodel has to learn them from the (relatively small) Visual7W training set.

Interestingly, the gap between the binary and softmax models is smaller onthe VQA datasets. This result may be explained by the way the incorrect-answerchoices were produced in both datasets: the choices are human-generated for eachquestion in the Visual7W dataset, whereas in the VQA dataset, the choices arerandomly chosen from a predefined set.

What is the influence of convolutional network architectures? Nearlyall prior work on VQA uses features extracted using a convolutional networkthat is pre-trained on Imagenet to represent the image in an image-questionpair. Table 5 shows to what extent the quality of these features influences theVQA performance by comparing five different convolutional network architec-tures: AlexNet [26], GoogLeNet [1], and residual networks with three differentdepths [2]. While the performance on Imagenet is correlated with performancein visual question answering, the results show this correlation is quite weak: areduction in the Imagenet top-5 error of 18% corresponds to an improvement ofonly 3% in question-answering performance. This result suggests that the per-formance on VQA tasks is limited by either the fact that some of the visualconcepts in the questions do not appear in Imagenet, or by the fact that theconvolutional networks are only trained to recognize object presence and not topredict higher-level information about the visual content of the images.

Model BoW LSTM

Bilinear 51.5 52.5

MLP 56.1 51.0

Table 6. Accuracy on Visual7W Telling dataset ofa bag-of-words (BoW ) and a LSTM model. We didnot use image features to isolate the difference be-tween language models. Higher values are better.

Do recurrent networks improve over bag of words? Our baseline uses asimple bag-of-words (BoW) model to represent the questions and answers. Re-current networks (in particular, LSTMs [27]) form a popular alternative for BoWmodels. We perform an experiment in which we replace our BoW representations

Page 8: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

8 Jabri, Joulin, and van der Maaten

Model Method What Where When Who Why How Overall

MLP(A+Q)

Scratch 54.9 60.0 76.0 65.9 63.8 40.2 56.1

Transfer 44.7 38.9 32.9 49.6 45.0 27.3 41.1

MLP(A+I)

Scratch 60.0 73.9 80.0 70.3 63.9 36.6 61.0

Transfer 28.4 26.6 44.1 37.0 31.7 25.2 29.4

MLP(A+Q+I)

Scratch 64.4 76.2 82.3 72.8 69.5 40.9 64.8

Transfer 58.7 61.7 41.7 60.2 53.2 29.1 53.8

Finetune 66.4 76.3 81.6 73.1 68.7 41.7 65.8

Table 7. Accuracy on Visual7W of models (1) trained from scratch, (2) transferedfrom the VQA dataset, and (3) finetuned after transferring. Higher values are better.

by an LSTM model. The LSTM was trained on the Visual7W Telling trainingset, using a concatenation of one-hot encodings and pre-trained word2vec em-beddings as input for each word in the question.

We experimented both with using the average over time of the hidden statesas feature representation for the text, as well as using only the last hidden state.We observed little difference between the two; here, we report the results usingthe last-state representation.

Table 6 presents the results of our experiment comparing BoW and LSTMrepresentations. To study just the difference between the language models, we didnot use images features as input in this experiment. The results show that despitetheir greater representation power, LSTMs actually do not outperform BoWrepresentations on the Visual7W dataset, presumably, because the dataset isquite small and the LSTM overfits easily. This may also explain why attentionalLSTM models [8] perform poorly on the Visual7W dataset.

Can we transfer knowledge from VQA to Visual7W? An advantage ofour models is that they can readily be transfered between datasets: they do notsuffer from out-of-vocabulary problems nor do they require the set of answersto be known in advance, which facilitates transfer learning. Table 7 shows theresults of a transfer-learning experiment in which we train our model on the VQAdataset, and use it to answer questions in the Visual7W dataset. We used threedifferent variants of our model, and experimented with three different input sets.The table presents three sets of results: (1) baseline results in which we trainedon Visual7W from scratch, (2) transfer results in which we train on VQA but teston Visual7W, and (3) results in which we train on VQA, finetune on Visual7W,and then test on Visual7W.

The poor performance of the A+I transfer-learning experiment suggests thatthere is a substantial difference in the answer distribution between both datasets.Nonetheless, transferring our full model from VQA to Visual7W works surpris-ingly well: we achieve a 53.8% score, which is less than 2% worse than the previ-ous state-of-the-art [8], despite the fact that we did not use any of the Visual7Wtraining data. If we finetune the transferred model on the Visual7W dataset, it

Page 9: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Revisiting Visual Question Answering Baselines 9

actually outperforms a model trained from scratch on that same dataset, ob-taining an accuracy of 65.8%. This additional performance boost likely stemsfrom the model adjusting to the biases in the Visual7W dataset.

5 Error Analysis

To better understand the shortcomings and limitations of our models, we per-formed an error analysis of the best model we obtained in Section 4 on six typesof questions, which are illustrated in Figure 3–5.

What is the color ofthe tree leaves?

What is the color ofthe train?

What shape is thissign?

What shape is theclock?

-Green. -Green. -Octagon. -Cube.

-Brown. -Yellow. -Oval. -Circle.

-Orange. -Black. -Hexagon. -Oval.

-Red. -Red. -Square. -Rectangle.

Fig. 3. Examples of good and bad predictions by our visual question answering modelon color and shape questions. Correct answers are typeset in green; incorrect predictionsby our model are typeset in red. See text for details.

Colors and Shapes. Approximately 5, 000 questions in the Visual7W test setare about colors and approximately 200 questions are about shapes. While colorsand shapes are fairly simple visual features, our models only achieve around 55%accuracy on these types of questions. For reference, our (A+Q) baseline alreadyachieves 50% in accuracy. This means that our models primarily learn the bias inthe dataset. For example, for shape, it predicts either “circle”, “round”, or “oc-tagon” when the question is about a “sign”. For color questions, even thoughthe performances are similar, it appears that the image-based models are ableto capture additional information. For example, Figure 3 shows that the modeltends to predict the most salient color, but fails to capture color coming fromsmall objects, which constitute a substantial number of questions in the Vi-sual7W dataset. This result highlights the limits of using global image featuresin visual question answering.

Counting. There are approximately 5, 000 questions in the Visual7W test setthat involve counting the number of objects in the image (“how many ...?”). Onthis type of questions, our model achieves an accuracy of 36%. This accuracy ishardly better than that the 35% achieved by the (Q+A) baseline. Again, thisimplies that our model does not really extract information from the image thatcan be used for counting. In particular, our model has a strong preference foranswers such as: “none”, “one”, or “two”.

Page 10: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

10 Jabri, Joulin, and van der Maaten

How many clouds arein the sky?

How many giraffes sit-ting?

What is behind thephotographer?

What color leaves areon the tree behind theelephant on the left of

the photo?-None. -Three. -A bus. -Red.

-Three. -One. -A dump truck. -Orange.

-Five. -Two. -A duck. -Green.

-Seven. -Four. -A plate of food. -Brown.

Fig. 4. Examples of good and bad predictions by our visual question answering modelon counting and spatial reasoning. Correct answers are typeset in green; incorrectpredictions by our model are typeset in red. See text for details.

Spatial Reasoning. We refer to any question that refers to a relative po-sition (“left”, “right”, “behind”, etc.) as questions about “spatial reasoning”.There are approximately 1, 500 such questions in the Visual7W test set. Onquestions requiring spatial reasoning, our models achieve an accuracy of approx-imately 50%, whereas a purely text-based model achieves an accuracy 40%. Thissuggests that our models, indeed, extract some information from the images thatcan be used to make inferences about spatial relations.

What is the man do-ing?

What is the man do-ing?

Why is the groundwhite?

Why is his arm up?

-Surfing. -Golfing. -Snow.-To serve the tennis

ball.-Singing. -Playing tennis. -Sand. -About to hit the ball.

-Working. -Walking. -Stones. -Reaching for the ball.

-Playing. -Biking. -Concrete. -Swinging his racket.

Fig. 5. Examples of good and bad predictions by our visual question answering modelon action and causality. Correct answers are typeset in green; incorrect predictions byour model are typeset in red. See text for details.

Actions. We refer to any question that asks what an entity is “doing” as an “ac-tion” question. There are approximately 1, 200 such questions in the Visual7Wtest set. Our models achieve an accuracy of roughly 75% on action questions.By contrast, a purely text-based model achieves an accuracy around 65%. Thisresult suggests that our model does learn to exploit image features in recognizingactions. This result is in line with results presented in earlier studies that showimage features transfer well to simple action-recognition tasks [28,29].

Page 11: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Revisiting Visual Question Answering Baselines 11

Causality. “Why” questions test the model’s ability to capture a weak form ofcausality. There are around 2, 600 of them. Our model has an accuracy of 68%on such questions, but a simple text-based model already obtains 62%. Thismeans that most “why” questions can be answered by looking at the text. Thisis unsurprising, as many of these questions refer to common sense that is encodedin the text. For example, in Figure 5, one hardly needs the image to correctlypredict that the ground is “white” because of “snow” instead of “sand”.

6 Discussion and Future Work

This paper presented a simple alternative model for visual question answering,explored variants of this model, and experimented with transfer between VQAdatasets. Our study produced much stronger baseline systems than those pre-sented in prior studies. In particular, our results demonstrate that featurizingthe answers and training a binary classifier to predict correctness of an image-question-answer triplet leads to substantial performance improvements over thecurrent state-of-the-art on the Visual7W Telling task: our best model obtainsan accuracy of 64.8% when trained from scratch, and 65.8% when transferredfrom VQA and finetuned on the Visual7W, which is a 10% improvement over theoriginal LSTM-Att model of [8], which achieves 55.6%. On the VQA Real Multi-ple Choice tasks, our model outperforms models that use LSTMs and attentionmechanisms, and is close to the state-of-the-art despite being very simple.

Our error analysis demonstrates that future work in visual question answer-ing should focus on grounding the visual entities that are present in the images,as the “difficult” questions in the Visual7W dataset cannot be answered withoutsuch grounding. Whilst global image features certainly help in question answer-ing, they do not provide sufficient grounding of the visual concepts becausecurrent attentional mechanisms are very crude. More precise grounding of visualentities, as well as reasoning about the relations between these entities, is likelyto be essential in making further progress.

Furthermore, in order to accurately evaluate future models, we need to un-derstand the biases in VQA datasets. Many of the complex methods in priorwork perform worse than the simple model presented in this paper. Presum-ably, one of two things (or both) may explain these results: (1) it may be that,currently, the best-performing models are those that can exploit biases in VQAdatasets the best, i.e., models that “cheat” the best; (2) it may be that cur-rent, early VQA models are unsuitable for the difficult task of visual questionanswering, as a result of which all of them hit roughly the same ceiling in theexperiments. In some of our experiments, we have seen that a model that ap-pears qualitatively better may perform worse quantitatively, because it capturesdataset biases less well. To address such issues, it may be necessary to consideralternative evaluation criterions that are less sensitive to dataset bias.

Finally, the results of our transfer-learning experiments suggest that explor-ing the ability of VQA systems to generalize across datasets may be an interestingalternative way to evaluate such systems in future work.

Page 12: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

12 Jabri, Joulin, and van der Maaten

References

1. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. (2015)

2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.(2016)

3. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S.,Kalanditis, Y., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Visual genome:Connecting language and vision using crowdsourced dense image annotations. In:arXiv 1602.07332. (2016)

4. Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual Turing test for com-puter vision systems. Proceedings of the National Academy of Sciences 112(12)(2015) 3618–3623

5. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C., Parikh, D.:VQA: Visual question answering. In: Proceedings of the International Conferenceon Computer Vision. (2015)

6. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image questionanswering. In: Advances in Neural Information Processing Systems. (2015)

7. Yu, L., Park, E., Berg, A., Berg, T.: Visual madlibs: Fill in the blank imagegeneration and question answering. In: arXiv:1506.00278. (2015)

8. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question an-swering in images. In: arXiv:1511.03416. (2015)

9. Zhou, B., Tian, Y., Sukhbataar, S., Szlam, A., Fergus, R.: Simple baseline forvisual question answering. In: arXiv:1512.02167. (2015)

10. Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention invisual question answering: Do humans and deep networks look at the same regions?arXiv preprint arXiv:1606.03556 (2016)

11. Koehn, P.: Statistical significance tests for machine translation evaluation. In:EMNLP. (2004) 388–395

12. Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluation the role of bleu inmachine translation research. In: EACL. Volume 6. (2006) 249–256

13. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,Zitnick, C.: Microsoft coco: Common objects in context. In: Proceedings of theEuropean Conference on Computer Vision. (2014)

14. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based ap-proach to answering questions about images. In: Proceedings of the InternationConference on Computer Vision. (2015)

15. Wu, Q., Shen, C., van den Hengel, A., Wang, P., Dick, A.: Image captioning andvisual question answering based on attributes and their related external knowledge.In: arXiv 1603.02814. (2016)

16. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia:A nucleus for a web of open data. Springer (2007)

17. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking toa machine? Dataset and methods for multilingual image question answering. In:Advances in Neural Information Processing Systems. (2015)

18. Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutionalneural network. In: arXiv:1506.00333. (2015)

Page 13: Revisiting VisualQuestion AnsweringBaselines …arXiv:1606.08390v1 [cs.CV] 27 Jun 2016 Revisiting VisualQuestion AnsweringBaselines Allan Jabri, Armand Joulin, and Laurens van der

Revisiting Visual Question Answering Baselines 13

19. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional questionanswering with neural module networks. In: arXiv:1511.02799. (2015)

20. Fukui, A., Huk Park, D., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-modal compact bilinear pooling for visual question answering and visual grounding.(2016)

21. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attentionfor visual question answering. (2016)

22. Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual questionanswering. arXiv preprint arXiv:1511.07394 (2016)

23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:Dropout: A simple way to prevent neural networks from overfitting. J. Mach.Learn. Res. 15(1) (January 2014) 1929–1958

24. Gross, S., Wilber, M.: Training and investigating residual nets. (2016)25. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-

sentations in vector space. In: arXiv 1301.3781. (2013)26. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convo-

lutional neural networks. In: Advances in Neural Information Processing Systems.(2012)

27. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation9(8) (1997) 1735–1780

28. Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual featuresfrom large weakly supervised data. In: arXiv 1511.0225. (2015)

29. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf:an astounding baseline for recognition. In: arXiv 1403.6382. (2014)