Top Banner
From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh 1 , Anand Mishra 2 , Shashank Shekhar 3 , Anirban Chakraborty 3 1 TCS Research , 2 IIT Jodhpur , 3 IISc Bangalore
20

Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Apr 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

From Strings to Things: Knowledge-enabled VQA Model that

can Read and ReasonAjeet Kumar Singh1 , Anand Mishra2 ,

Shashank Shekhar3, Anirban Chakraborty3

1TCS Research , 2IIT Jodhpur , 3IISc Bangalore

Page 2: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Problem

Page 3: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

ProblemTraditional VQA [Antol et al., ICCV’15, Zhang et al., ICLR’18 ]Q: How many cars are there in this image?A: 2

Page 4: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

ProblemTraditional VQA [Antol et al., ICCV’15, Zhang et al., ICLR’18 ]Q: How many cars are there in this image?A: 2

ST-VQA, Text-VQA [Biten et al., ICCV’19, Singh et al., CVPR’19]Q: Which restaurant brand is written on the red wall?A: KFC

Page 5: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

ProblemTraditional VQA [Antol et al., ICCV’15, Zhang et al., ICLR’18 ]Q: How many cars are there in this image?A: 2

ST-VQA, Text-VQA [Biten et al., ICCV’19, Singh et al., CVPR’19]Q: Which restaurant brand is written on the red wall?A: KFC

Text + Knowledge-enabled VQA [This work]Q: Can I get chicken dish here?A: Yes

Answering requires external knowledge

produces

Page 6: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

ProblemTraditional VQA [Antol et al., ICCV’15, Zhang et al., ICLR’18 ]Q: How many cars are there in this image?A: 2

ST-VQA, Text-VQA [Biten et al., ICCV’19, Singh et al., CVPR’19]Q: Which restaurant brand is written on the red wall?A: KFC

Text + Knowledge-enabled VQA [This work]Q: Can I get chicken dish here?A: Yes

New problem, No dataset exists!

produces

Page 7: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

text-KVQA: A novel dataset

Q: Is this a chinese restaurant? A: No

Q: Can I get medicine here?A: Yes

Q: When was this movie released?A: 1995

● 257K Images, 1 Million QA Pairs● Associated knowledge base● First dataset: text recognition + Knowledge graph + VQA

Page 8: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Knowledge base

Question:Is this an American brand?

Page 9: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Knowledge base

Proposal Module

Word proposals [Gupta et al., CVPR’16] Scene proposals[Zhou et al., TPAMI’17]

Word proposals:Subway, Open

Scene proposals:Fast food restaurant,

shop frontQuestion:

Is this an American brand?

Page 10: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Fusion ModuleWord proposals:

Subway, Open

Scene proposals:Fast food restaurant, shop front

Fusion

Relevance score of each knowledge fact:

Knowledge base

Question:Is this an American brand?

Page 11: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Subway

restaurant USA

KFC

1965sandwich

producesFounded in

Is a Brand of

Question:Is this an American brand?

Word proposals:Subway, Open

Scene proposals:Fast food restaurant, shop front

Brand ofIs a

Page 12: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Subway

restaurant USA

KFC

1965sandwich

producesFounded in

Is a Brand of

Question:Is this an American brand?

Word proposals:Subway, Open

Scene proposals:Fast food restaurant, shop front

Brand ofIs a

Page 13: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Subway

restaurant USA

KFC

1965sandwich

producesFounded in

Is a Brand of

Question:Is this an American brand?

Word proposals:Subway, Open

Scene proposals:Fast food restaurant, shop front

Brand ofIs a

Page 14: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Subway

restaurant USA

KFC

1965sandwich

producesFounded in

Is a Brand of

Question:Is this an American brand?

Word proposals:Subway, Open

Scene proposals:Fast food restaurant, shop front

Brand ofIs a

Page 15: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Subway

restaurant USA

KFC

1965sandwich

producesFounded in

Is a Brand of

Question:Is this an American brand?

Word proposals:Subway, Open

Scene proposals:Fast food restaurant, shop front

Brand ofIs a

Graph representation: Gated Graph Neural Network (GGNN) [Li et al., ICLR’15]

Page 16: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Proposed Solution

Subway

restaurant USA

KFC

1965sandwich

producesFounded in

Is a Brand of

Question:Is this an American brand?

Word proposals:Subway, Open

Scene proposals:Fast food restaurant, shop front

Brand ofIs a

Graph representation: Gated Graph Neural Network (GGNN) [Li et al., ICLR’15]

Candidate answerYes

Page 17: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

text-KVQA accuracy (%)

Traditional VQA methods are not successful

Page 18: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

A popular QA over KB method improves the performance

text-KVQA accuracy (%)

Page 19: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Our GGNN-based full model (text + vision) further improves the performance

text-KVQA accuracy (%)

Page 20: Shashank Shekhar , Anirban Chakraborty Knowledge-enabled ... · From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason Ajeet Kumar Singh1 , Anand Mishra2 , Shashank

Summary

Please visit us at poster number: 18

1. text-KVQA: first dataset for knowledge-enabled VQA by reading text in image

2. Novel GGNN formulation

Dataset available at https://textkvqa.github.io/

Acknowledgements: Anirban Chakraborty is supported by Tata Trusts Travel Grant.