Universität Hamburg MIN-Fakultät Fachbereich Informatik TAMS-Folien Natural Language Visual Grounding with Keyword-Aware Attention Network Jinpeng Mi Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Fachbereich Informatik Technische Aspekte Multimodaler Systeme 8. Januar 2019 Jinpeng Mi 1
24
Embed
Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Universität Hamburg
MIN-FakultätFachbereich Informatik
TAMS-Folien
Natural Language Visual Grounding withKeyword-Aware Attention Network
Jinpeng Mi
Universität HamburgFakultät für Mathematik, Informatik und NaturwissenschaftenFachbereich Informatik
Technische Aspekte Multimodaler Systeme
8. Januar 2019
Jinpeng Mi 1
Universität Hamburg
MIN-FakultätFachbereich Informatik
TAMS-Folien
Gliederung
1. IntroductionNatural Language Visual GroundingAttention Mechanism
Introduction - Natural Language Visual Grounding TAMS-Folien
Natural Language Visual Grounding
I task: given a referring expression, localize the referred object orarea in an image
referring expression: a glass of water on the table
Jinpeng Mi 3
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Natural Language Visual Grounding TAMS-Folien
I applications: visual understanding systems, dialogue systems,natural language based interaction with intelligence agents, e.g.,robots
I main difficulties:
- how to learn the correlation between natural language referringexpression and visual domain (image region)
- how to locate the target object (the spatial relationship betweenobjects)
Jinpeng Mi 4
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Natural Language Visual Grounding TAMS-Folien
visual grounding is re-formulated three sub-problems:I which words to focus on in a referring expression
I where to look in an image
I which object to locate
Jinpeng Mi 5
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Natural Language Visual Grounding TAMS-Folien
public datasets
I RefCOCO: 19994 images, 142210 expressions
Jinpeng Mi 6
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Natural Language Visual Grounding TAMS-Folien
I RefCOCO+: 19992 images, 141564 expressions
Jinpeng Mi 7
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Natural Language Visual Grounding TAMS-Folien
I RefCOCOg: 25799 images, 95010 expressions (no test set)
Jinpeng Mi 8
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Attention Mechanism TAMS-Folien
Attention Mechanism
I inspired by how the human visual cortex employs visualattention mechanism to focus on informative regions in visualscenes
I first proposed in machine translation[1], image captioning[2]I type: hard attention and soft attention
[1]Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning toalign and translate, ICLR 2014.[2]Xu, K., Ba, J., Kiros, ..., Bengio, Y. Show, attend and tell: Neural image captiongeneration with visual attention, ICML 2015.
Jinpeng Mi 9
Universität Hamburg
MIN-FakultätFachbereich Informatik
Introduction - Attention Mechanism TAMS-Folien
Architecture
I which words to focus onI where to look in an imageI which object to locate
I referring expression filteringfiler insignificant words: determiner, coordinating conjunction, "to",interjection, modal words, linking verb, etc.I examplesraw: young man with blond hair wearing a white shirt and dark tiein a ballroomfiltered: young man with blond hair wearing white shirt dark tie inballroom
raw:a person standing behind a snowboarder with a blue jacket andblack pantsfiltered:person standing behind snowboarder with blue jacket blackpants
where Ww , bw and βw are trainable vectors, rt is calculatedweights, � denotes element-wise production.* Yang Z, Yang D, Dyer C, et al. Hierarchical attention networks for documentclassification. Proceedings of NAACL-HLT 2016.
where v′ denotes projected feature map, f is non-linear function,Ws and bs are trainable vectors, Matten is generated attention map,� denotes element-wise production.