Learning Visually-Grounded Semantics from Contrastive Adversarial Samples Haoyue Shi* 1 , Jiayuan Mao* 2 , Tete Xiao* 1 , Yuning Jiang 3 and Jian Sun 3 1 : Peking University 2 : Tsinghua University 3 : Megvii, Inc {hyshi, jasonhsiao97}@pku.edu.cn, [email protected], {jyn, sunjian}@megvii.com I NTRODUCTION Visual-Semantic Embeddings (VSE) • Use paralleled image-caption pairs and embed texts and images into a joint space. • Several datasets have been created for such purpose. • However, even MS-COCO[1] is too small compared with the compositional semantic space. VSE with Contrastive Adversarial Samples (this work) • Show the limitation of existing datasets and frameworks through adversarial attacks. • Close the gap with semantics-aware text augmentation. • Evaluate the visual grounding on multiple tasks. AS IMPLE YET E FFECTIVE A PPROACH Add the Contrastive * Adversarial Samples to the Training Set * : Use the online hard example mining (OHEM) technique to find “Contrastive” ones. VSE [2]: min ‘ V SE (i, c)= X c 0 [α + s(i, c 0 ) - s(i, c)] + + X i 0 [α + s(i 0 ,c) - s(i, c)] + VSE++ [3]: min ‘ VSE++ (i, c) = max c 0 6=c [α + s(i, c 0 ) - s(i, c)] + max i 0 6=i [α + s(i 0 ,c) - s(i, c)] VSE-C (ours): min ‘ VSE-C (i, c)= ‘ VSE++ (i, c)+ max c 00 ∈C 0 (c) [α + s(i, c 00 ) - s(i, c)] + i: image, c: caption, C 0 : adversarial samples. B EGIN WITH A DVERSARIAL A TTACKS Three giraffes and a rhino graze from trees . relation: graze from Original Caption Image Contrastive Adversarial Samples Three cows and a rhino graze from trees. noun : numeral / indefinite article: Three giraffes and three rhinos graze from trees. relation: Three giraffes and a rhino graze on trees. Trees graze from three giraffes and a rhino . numeral indef. noun noun noun Semantics-aware Text Augmentation (Adversarial Samples) • Noun: use Word-Net [4] to compare the word similarity (e.g., Synonyms, Hypernyms). • Numeral/Indefinite Article: singularize or pluralize corresponding nouns when necessary. • Relation: dependency-parsing based subject and object detection. Result Model R@1 R@10 Med r. Avg r. MS-COCO Test VSE [2] 47.7 87.8 2.0 5.8 VSE++ [3] 55.7 92.4 1.0 4.3 VSE-C (+n.) 50.7 90.7 1.0 5.2 VSE-C (+num.) 53.3 90.2 1.0 5.8 VSE-C (+rel.) 52.4 89.0 1.0 5.7 VSE-C (+all) 50.2 89.8 1.0 5.2 MS-COCO Test w/ Adversarial Samples VSE [2] 28.0 71.6 4.0 11.7 VSE++ [3] 35.6 72.5 3.0 11.8 VSE-C (+n.) 40.3 80.2 2.0 9.2 VSE-C (+num.) 46.9 86.3 2.0 6.9 VSE-C (+rel.) 42.3 82.5 2.0 7.2 VSE-C (+all) 47.4 88.8 2.0 5.5 G ROUNDING T EST I: W ORD -O BJECT C ORRELATION Task Description Image Captions A table with a huge glass vase and fake flowers come out of it. A plant in a vase sits at the end of a table . A vase with flowers in it with long stems sitting on a table with candles . A large centerpiece that is sitting on the edge of a dining table . Flowers in a clear vase sitting on a table . Positive Objects: table, plant, vase. Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc. Model Result Image Encoder (ResNet 152) Vase Image Embedding: !(#) Word Embedding: %(&) Embedding Interaction Pr Positive #, &] Word: & Model MAP GloVe [5] 58.7 VSE [2] 61.7 VSE++ [3] 61.1 VSE-C (ours, +all) 62.2 VSE-C (ours, +n.) 62.8 VSE-C (ours, +rel.) 62.3 VSE-C (ours, +num.) 62.0 S ALIENCY V ISUALIZATION Which part in the image or caption, in particular, makes them semantically different? We compute the Jacobian (we normalize the textual saliency for visualization): J = ∇ i s(i, c 0 )= ∇ i W T i f (i; θ i ) · W T c g (c 0 ; θ c ) an elephant walking against the weeds in the forest 0.039 0.176 0.101 0.087 0.051 0.248 0.060 0.057 0.181 an elephant walking against the weeds in the forest 0.030 0.108 0.125 0.258 0.108 0.176 0.077 0.027 0.090 an elephant walking through the weeds in the forest VSE++ VSE-C Original Image Image Saliency (VSE-C) P APER &C ODE Paper is available at http://aclweb.org/ anthology/C18-1315. Code is available at https://github.com/ExplorerFreda/ VSE-C. paper code A CKNOWLEDGEMENTS This work was done when HS, JM and TX were intern researchers at Megvii Inc. HS, JM and TX contribute equally to this paper. G ROUNDING T EST II: F ILL - IN - THE -B LANK Model Result A table with a huge glass _____ and fake flowers come out of it. GRU Encoder Word Embeddings Fusing with MLP Predicted Word: Vase …… …… …… …… …… ….… Image Embedding: !(#) Model R@1 R@10 Noun Filling GloVe 23.2 58.8 VSE++ 25.0 61.7 VSE-C (ours) 27.3 62.9 Prep. Filling GloVe[5] 23.3 79.9 VSE++ 34.9 84.9 VSE-C (ours) 35.2 85.2 All (Noun + Prep.) GloVe 23.3 66.6 VSE++ 28.4 68.1 VSE-C (ours) 30.0 70.98 R EFERENCES [1] Lin et al. Microsoft COCO: Common Objects in Context. In ECCV, 2014. [2] Kiros et al. Unifying Visual-Semantic Embeddings with Multimodal Neu- ral Language Models. arXiv preprint arXiv:1411.2539, 2014. [3] Faghri et al. Vse++: Improving visual- semantic embeddings with hard negatives. In BMVC, 2018. [4] George A Miller. WordNet: a lexical database for English. Communications of the ACM, 1995. [5] Pennington et al. GloVe: Global Vectors for Word Representation. In EMNLP, 2014.