Holistic Scene Understanding via Multiple Structured Hypotheses from Perception Modules Gordon Christie 1 * Ankit Laddha 2 * Aishwarya Agrawal 1 Stanislaw Antol 1 Yash Goyal 1 Dhruv Batra 1 1 Virginia Tech 2 CMU *equal contribution Overview Motivation Perception problems are hard Goal Holistic Scene Understanding (inputs from multiple modules) Challenges • Inaccurate models • Search space explosion Experiment 1: Captioned Scene Understanding Experiment 2: Indoor Scene Understanding Proposed Solution all possible segmentations all possible support estimations all possible sentence parsings X X .. X Semantic Segmentation Sentence Parsing “A dog is standing next to a woman on a couch” Couch Person Person Person Dog Consistent Couch Module 1: Semantic Segmentation (SS) Module 2: Prepositional Phrase Attachment Resolution (PPAR) Datasets : ABSTRACT-50S/ PASCAL-50S/ NYU-v2 Features : Module, Consistent Preposition and Presence Approach • Extract diverse hypotheses from multiple modules [1] • Jointly reason about hypotheses o Develop “Mediator” model (factor graph) o Infer consistency ABSTRACT-50S Module INDEP Ours-MEDIATOR oracle PPAR 56.73 77.39 97.53 NYUv2 Module INDEP Ours-CASCADE Ours-MEDIATOR oracle SS 46.13 46.05 46.37 51.30 PPAR 61.54 57.69 64.42 92.31 Average 53.84 51.87 55.40 71.81 PASCAL-50S Module INDEP Ours-CASCADE Ours-MEDIATOR oracle SS 31.14 32.68 34.12 38.87 PPAR 62.42 78.92 87.00 96.50 Average 46.78 55.80 60.56 67.68 Methods: • INDEP: 1-best solution for each module • Ours-CASCADE: DivMBest for module1 + 1-best for module2 • Ours-MEDIATOR: DivMBest for module1 and module2 • oracle: best tuple always selected Semantic Segmentation 3D Support Estimation Consistent Other Structure Wall Other Prop Wall Other Structure Wall Table Sofa Table Other Prop Chair Wall Wall Television Module 1: Semantic Segmentation (SS) Module 2: 3D Support Estimation (SE) Dataset: NYUv2 Experiment 2 Results Module INDEP Joint Ours- CASCADE Ours- MEDIATOR oracle SS 64.24 62.00 64.22 64.24 70.24 SE 55.48 56.43 57.38 57.33 62.29 Average 59.86 59.22 60.80 60.79 66.27 Experiment 3: Urban Scene Understanding Module 1: 2D Semantic Segmentation Module 2: 3D Semantic Segmentation Dataset: CITY (stereo) 2D Semantic Segmentation 3D Semantic Segmentation Consistent Road Building Sky Person Car Sidewalk Building Road Sidewalk Curb Person Car Sky Road Mark Sidewalk Sidewalk Person Sign Sky Tree/bush Building Building Mark Road Building Sign Car Tree/bush Road Building Car Mark Car Road Tree/bush Sidewalk Sidewalk Curb Car Tree/bush Building Experiment 3 Results Module INDEP Ours- CASCADE Ours- MEDIATOR oracle 2D SS 54.80 55.65 55.65 57.82 3D SS 32.07 57.16 57.98 61.15 Average 43.44 56.41 56.82 59.49 [1] D. Batra et al. Diverse M-Best Solutions in Markov Random Fields. In ECCV, 2012. INDEP Ours-MEDIATOR INDEP Ours- MEDIATOR INDEP Ours- MEDIATOR +20.7 % +1.6 % +13.8 % Ours- MEDIATOR INDEP +0.93 % Ours- MEDIATOR INDEP +13.4 % … couch couch dog cat couch …