Top Banner
{name.surname} University of Modena and Reggio Emilia, Italy Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini Vision, Language and Action: from Captioning to Embodied AI

Vision, Language and Action

May 25, 2022



Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Page 1: Vision, Language and Action


University of Modena and Reggio Emilia, Italy

Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini

Vision, Language and Action: from Captioning to Embodied AI

Page 2: Vision, Language and Action


University of Modena and Reggio Emilia, Italy

Massimiliano Corsini

Part V

Embodied AI and VLN

Page 3: Vision, Language and Action

• Generally speaking, embodied means to give a physical body to something. In this context, this means to give to Artificlal Intelligent algorithm “a body” to make possible that an agent solves some tasks.

• Tasks involved in embodied AI:

• Embodied Visual Recognition

• Embodied Question Answering

• Interactive Question Answering

• Visual Navigation

• Vision-and-Language Navigation

Embodied AI


Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra, “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019.

Page 4: Vision, Language and Action

Embodied AI

4Figure adaptep from Savva et al. , “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019. 4

Page 5: Vision, Language and Action

• In the following we will describe in depth Vision-and-Language Navigation (VLN). A very recent and active research trends.

• VLN has been defined as:

• Interpret a previously unseen natural language navigation command in light of images generated by a previously unseen real environment (Anderson et al. CVPR 2018)

• Follow a given instruction to navigate from a starting location to a goal location (Fried et al. NeurIPS 2018)

• Reach a target location by navigating unseen environments, with a natural language instruction as only clue (Landi et al. BMVC 2019)

Embodied AI – VLN


Page 6: Vision, Language and Action

• Dataset of spaces:

• ScanNet (Dai et al. 2017)

• Stanford 2D-3D-Semantics (Armeni et al. 2017)

• Matterport3D (Chang et al. 2017)

• Replica (Straub et al. 2019)

• Dataset for VLN:

• R2R – Room to Room (Anderson et al. 2018)

• Touchdown (Chen et al. 2019)

• Simulation Environments

• Matterport3D Simulator (Anderson et al. 2018)

• Gibson (Zamir et al. 2018)

• Habitat (Savva et al. 2019)

Key Aspects


Page 7: Vision, Language and Action

• ScanNet is an RGB-D video dataset containing reconstructed scenes with instance-level semantic segmentations.

• 2.5 millions frames (RGBD) acquired through a custom device (similar to a Kinect)

• 1,513 scenes reconstructed (volume fusion)

• Instance-level semantic segmentation (20 classes + one class for free space) through 3D CAD models alignment

ScanNet (Dai et al. 2017)


Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Niessner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes}”, Proc. Computer Vision and Pattern Recognition (CVPR), 2017.

Page 8: Vision, Language and Action

• Data collected using a Matterport Camera for indoor acquisition

• > 6,000 m2 of indoor environment

• 1,413 equirectangular RGB images with corresponding depth and surface normal plus instance-level semantic data

• Annotation is performed in 3D, then projected onto the images (13 object classes)

Stanford 2D-3D-S (Armeni et al. 2017)

8Iro Armeni, Sasha Sax, Amir Roshan Zamir, Silvio Savarese, “Joint 2D-3D-Semantic Data for Indoor Scene Understanding”, arXiv:1702.01105, 2017.

Page 9: Vision, Language and Action

• Created using the Matterport Camera (again)

• 90 buildings , 10,800 panoramic views , 194,400 RGBD images

• Corresponding textured 3D models

• Instance segmentation is provided

Matterport3D (Chang et al. – 3DV2017)


A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, Y. Zhang, ”Matterport3D: Learning from RGB-D Data in Indoor Environments”, International Conference on 3D Vision (3DV 2017), 2017.

Page 10: Vision, Language and Action

• Ultra high photo-realism (Replica Turing Test)

• Designed for many tasks: egocentric vision, semantic segmentation, geometric inference, development of embodied agents (VLN, VQA)

• Acquired using a custom device (RGBD rig with IR projector) plus manual refinement

• 18 scenes of real world environments

• Dense hiqh-quality mesh, HDR textures

• Semantic annotation annotated in parallel using a 2D instance-level masking tool → transferred to the mesh using a voting scheme → a SEMANTIC FOREST (88 classes) is the final result.

• A minimal SDK to render the dataset is provided.

Replica (Straub et al. 2019)


Straub et al. “The Replica Dataset: A Digital Replica of Indoor Spaces”, arXiv:1906.05797, 2019.

Page 11: Vision, Language and Action

Dataset comparison


Straub et al. “The Replica Dataset: A Digital Replica of Indoor Spaces”, arXiv:1906.05797, 2019.

Page 12: Vision, Language and Action

• Gibson is a perception and physics simulator for the development of embodied agents (developed by Stanford),

• Renderer works with data acquired from Metterport Camera.

• Different types of robotic agents can be trained.

• Physical engine (based on PyBullet).

• Try to fill the perception gap between the rendering and the real world.

• No support for agent-agent interactions.



Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.

Page 13: Vision, Language and Action

Gibson – Rendering Engine


Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.

Page 14: Vision, Language and Action

Gibson – Filling the gap between rendering and real images


Fei Xia, Amir Roshan Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese, “Gibson Env: Real-World Perception for Embodied Agents”, CVPR 2018, pp.9068-9079.

Page 15: Vision, Language and Action

• Habitat is an open source framework for embodied AI

• Two main components:

• Habitat-SIM → 3D simulator

• Habitat-API → high-level library for end-to-end


• Features:

• High-quality rendering

• Generic dataset support: Matterport3D, Gibson, Replica

• Agents can be configured and equipped with different sensors

• Human-as-agent → this allow to investigate human-agent interactions and human-human interactions



Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra, “Habitat: A Platform for Embodied AI Research”, ArXiv: 1904.01201, 2019.

Page 16: Vision, Language and Action

• Builds upon Matterport3D dataset of spaces (Chang et al. 3DV 2017)

• 90 different buildings

• ~7k navigation paths

• 3 different descriptions / path

• ~29 words / instruction on average

• 2 different validation splits

• Test server with public leaderboard

R2R – Room to Room Benchmark


P. Anderson and Q. Wu and D. Teney and J. Bruce and M. Johnson and N. Sunderhauf and I. Reid and S. Gould and A. van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, in Proc. of CVPR2018.

Page 17: Vision, Language and Action

R2R – Room to Room Benchmark


P. Anderson and Q. Wu and D. Teney and J. Bruce and M. Johnson and N. Sunderhauf and I. Reid and S. Gould and A. van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, in Proc. of CVPR2018.

Page 18: Vision, Language and Action

• Real-life urban environment (built on Google StreetView image data)

• Touchdown task: to reach a goal position according to the given instructions (e.g. navigation task), and then resolving a spatial description by finding in the observed image a hidden teddy bear (spatial description resolution (SDR) task).

• Referring expression vs SDR:

• A referring expression discriminates an object w.r.t other objects.

• An SDR sentence describe a specific location rather than discriminating.

• 9,326 examples of English instructions

• 27,575 spatial description resolution tasks

• Language more complex than R2R

• Qualified annotators.

Touchdown (Chen et al. – CVPR2019)


H. Chen, A. Suhr, D. Misra, N. Snavely, Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 12538-12547.

Page 19: Vision, Language and Action

Touchdown (Chen et al. – CVPR2019)


H. Chen, A. Suhr, D. Misra, N. Snavely, Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2019), 2019, pp. 12538-12547.

SDR task

Page 20: Vision, Language and Action

• The first work to introduce Vision-and-Language Navigation (VLN) task.

• Both VQA and VLN can be seen as a sequence-to-sequence transcoding . VLN has sequences that typically are much longer than the ones in VQA and the model output actions.

• Contribution:

• Matterport3D Simulator → a framework for visual reinforcement learning built on Matterport3D dataset.

• Room-to-Room (R2R) → the first benchmark for VLN.

• A sequence-to-sequence neural network to solve the problem.

Algorithms – Anderson et al. – CVPR2018


Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.

Page 21: Vision, Language and Action

• Navigation graph: 𝐺 = < 𝑉, 𝐸 >

• The agent select a next reachable viewpoint 𝑣𝑡+1 ∈ 𝑊𝑡+1 from the reachable viewpoint 𝑊𝑡+1and adjust camera direction (azimuth angle) and camera elevation.

• The action space is : left, right, top, bottom, forward and stop.

• Proposed baseline:

• LSTM-based sequence-to-sequence architecture with an attention mechanism (Bahdanau et al. 2015)

Algorithms – Anderson et al. – CVPR2018


Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.

Page 22: Vision, Language and Action

• Language instructions encoding: each word 𝑥𝑖 is presented sequentially to the LSTM encoder as an embedding vector:

ℎ𝑖 = 𝐿𝑆𝑇𝑀𝑒𝑛𝑐 (𝑥𝑖 , ℎ𝑖−1)

• Image and action embedding: image feature are extracted using a ResNet-152 trained on ImageNet. An embedding is learned for each action. The encoded image and the previous actions features are concatenated together to form a vector 𝑞𝑡.

ℎ′𝑡 = 𝐿𝑆𝑇𝑀𝑑𝑒𝑐 𝑞𝑡, ℎ′𝑡−1

• Finally, the attention mechanism is applied to compute an instruction context 𝑐𝑡 = 𝑓(𝒉, ℎ𝑡′) before

the prediction of the action 𝑎𝑡 .

Algorithms – Anderson et al. – CVPR2018


Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, Anton van den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments”, CVPR 2018.

Page 23: Vision, Language and Action

• NE (Navigation Error)

• Distance between the agent final position and the goal

• SR (Success Rate)

• Fraction of episodes terminated within 3 meters from the goal

• OSR (Oracle SR)

• SR that the agent would have achieved if it received an oracle stop signal

• SPL (SR weighted by Path Length)

• SR weighted by normalized inverse path length (penalizes overlong navigations)

Evaluation Metrics


Page 24: Vision, Language and Action

• Main idea: to introduce a speaker module able to generate a description given a context.

• To synthesize new couples of path-instructions

• To enable pragmatic reasoning (Hemachandra et al. 2015)

• Hence, we have a Follower and a Speaker:

• Follower →map instructions to sequence of actions

• Speaker →map action sequences to instructions

• Both the Follower and the Speaker are based on standard sequence-to-sequence model.

Algorithms – Fried et al. – NIPS 2018


D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.

S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter. Learning models for following natural language directions in unknown environments. arXiv preprint arXiv:1503.05079, 2015.

Page 25: Vision, Language and Action

• Follower estimates: 𝑝𝐹(𝑟|𝑑) r: route d: descriptions

• Speaker estimates: 𝑝𝑆 𝑑 𝑟

• At training time the Speaker is used for data augmentation:

• M new paths 𝑟𝑘 (𝑘 = 1. .𝑀) are sampled as in Anderson et al. 2018

• ( Ƹ𝑟𝑘, መ𝑑𝑘) new path-descriptions data are generated (set S)

• At training time the Follower is trained on 𝑆 ∪ 𝐷 . Then, the Follower is fine tuned on the original dataset D.

Algorithms – Fried et al. – NIPS 2018


D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.

𝑑𝑘 = argmax𝑃𝑠(𝑑| Ƹ𝑟𝑘)

Page 26: Vision, Language and Action

• At test time the Speaker is used to selected the best path between K candidate paths (pragmatic inference)

• 𝑑 = argmax 𝑃𝑠(𝑑|𝑟) → solving this is not feasible

• Get the best of K candidate path according to:

• Panoramic actions space:

• The agent perceives a 360-degree panoramic image and only high-level decision are taken.


• SR metric on unseen environments is about 53.5% that is 30% better than previous approaches → Speaker works !!

Algorithms – Fried et al. – NIPS 2018


D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, T. Darrell, “Speaker-Follower Models for Vision-and-Language Navigation”, in NeurIPS, 2018.

argmax𝑃𝑠 𝑑 𝑟 𝜆𝑝𝐹 𝑟 𝑑 𝜆−1

Page 27: Vision, Language and Action

• Follows ideas from “Speaker-Follower” but the proposed approach is based on Reinforcement Learning (RL) and Imitation Learning (IL) instead of supervised learning.

• Contribution:

• Reinforced Cross Modal Matching (RCM) framework that employs an extrinsic and an intrinsic reward. This last reward guarantees cycle reconstruction consistency.

• Self-Supervised Imitation Learning (SIL) to explore the unseen environment on self-supervision and improve the overall performance.

Algorithms – Wang et al. – CVPR 2019


Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.

Page 28: Vision, Language and Action

• Reinforced Cross-Model Matching (RCM)


trajectory given the instruction X

An high value of p means that the predicted trajectory is aligned with the reconstructed instructions.


Algorithms – Wang et al. – CVPR 2019


Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.

Page 29: Vision, Language and Action

• The trade off between exploration and exploitation is one of the fundamental challenge in RL.

• Oh et al. proposed to exploit past good experience to improve exploration with theoretical justifications of the effectiveness of this approach.

• Following this idea the authors proposed SIL (Self-Supervised Imitation Learning):

• The agent imitates its past good decision.

• This is applied to unseen environments with no ground truth information.

• A set of random path is generated, then the matching critic is used to select the best trajectory and store it in a reply buffer.

• The trajectories stored in the reply buffer can be exploited.

Wang et al. 2019 – Self-Supervised Imitation Learning (SIL)


1. Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.

Page 30: Vision, Language and Action

• The beam search has not been used because it is not feasible for real scenario (note that this is a very interesting consideration of the authors)

• RCM pass from about 28% to 35% on unseen environment (SPL metric).

• SIL improves RCM by 17.5% on SR and by 21% on SPL.

Wang et al. – CVPR 2019 – Results


Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ”Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2019.

Page 31: Vision, Language and Action

• Recently, a new categorization has been introduced by Landi et al. (Landi et al. BMVC 2019)

• Methods are subdivided between two categories:

• Methods that work in low-level action space

• Methods that work in high-level action space

• Main contribution:

• SOTA for low-level methods

• It uses dynamic filters to jointly decide the agent’s actions.

Algorithms – Landi et al. (BMVC 2019)


Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 32: Vision, Language and Action

Landi et al. (BMVC 2019) – Action spaces

Low-level action space

Simulates continuous control of the agent

Move forward, turn left/right, tilt up/down, stop

High-level action space

Path selection on a discrete graph

Action space is a list of adjacent nodes

This work!

Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 33: Vision, Language and Action

Landi et al. (BMVC 2019) – Dynamic Convolutional Filters


Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

...or “let the sentence drive the


Li et al. CVPR 2017

Gavrilyuk et al. CVPR 2018


Actor and Action


Query: “Woman with ponytail running”

Query: “Small white fluffy puppy biting the cat”

Page 34: Vision, Language and Action

Landi et al. (BMVC 2019) – Dynamic Convolutional Filters



L2 Norm


L2 Norm


L2 Norm

... ...

...or “let the sentence drive the


Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 35: Vision, Language and Action

Landi et al. 2019 – Dynamic Convolutional Filters


Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

...or “let the sentence drive the


# output feature maps


# dynamic filters

Page 36: Vision, Language and Action

Landi et al. (BMVC 2019) – Architecture

36Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 37: Vision, Language and Action

• State-of-the-art in low-level action spaces

• Ablation study

• All the components are important

• Dynamic filters play the fundamental role

Landi et al. (BMVC 2019) – Results


Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 38: Vision, Language and Action

Landi et al. (BMVC 2019) – Results


Walk up the stairs. Turn

right at the top of the stairs

and walk along the red

ropes. Walk through the

open doorway straight

ahead along the red carpet.

Walk through that hallway

into the room with couches

and a marble coffee table.

Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 39: Vision, Language and Action

Landi et al. (BMVC 2019) – Results


Turn around and go down

the entranceway, heading

toward the staircase. Turn

to your left and walk past

the staircase to the open

door way. Stop near the

front of the doorway to the


Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara, "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.”, in British Machine Vision Conference (BMVC), 2019.

Page 40: Vision, Language and Action

• Simulation framework

• Flexibility

• Human interaction

• Evaluation needs further improvements:

• Few datasets (e.g. only one dataset for the Spatial Description Reasoning task)

• Performance on low-level actions vs high-level actions agent

• Metrics (we will see in a moment..)

• Real-world applications

• Lack of case studies



Page 41: Vision, Language and Action



Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge, “Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation”, in ACL 2019.

Blue path coming from R4R

SPL for the red path: 1.0 SPL for the orange path: 0.17CLS for the red path: 0.23CLS for the orange path: 0.87

Page 42: Vision, Language and Action

• Common metrics to evaluate VLN performance focus on reaching the goal instead of evaluating the step-by-step correspondences with the given instructions.

• We need instructions-oriented metrics instead of goal-oriented metrics.

• Recently proposed metrics for this purpose:

• Coverage Weighted by Length Score (CLS) (Jain et al. 2019).

• Metrics based on dynamic time warping (nDTW, SDTW) (Magalhães et al. 2019)

Problem with commonly used metrics


Page 43: Vision, Language and Action

• A new dataset has been created to evaluate the novel proposed metric: Room-for-Room (R4R)

• R4R is built by joining path and the corresponding description in R2R

• From goal-oriented to instruction-oriented metrics

• A list of desiderata is proposed:

• Path similarity

• Soft penalties

• Unique optimum

• Scale invariance

• Computational tractability

CLS (Jain et al. 2019)


Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).

Page 44: Vision, Language and Action

• CLS is defined as :

CLS (Jain et al. 2019)


Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge, “Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation”, in ACL 2019.

Path Coverage

Length Score

Page 45: Vision, Language and Action

• There proposed metrics are based on Dynamic Time Warping (DTW).

• DTW is a well-established method to measure the similarity between time series.

• The approach of DTW is to find an optimal warping to align the elements of the series such that the cumulative distance between the aligned elements is minimized.

• This problem can be solved using dynamic programming.

nDTW and SDTW (Magalhães et al. 2019)


Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).

Page 46: Vision, Language and Action

• Normalized Dynamic Time Warping (nDTW) metric:

• Success Weighted by Normalized Dynamic Time Warping (SDTW) metric:

nDTW and SDTW (Magalhães et al. 2019)


Gabriel Magalhães, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge, “Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping”, ArXiv: 1907.05446 (2019).

Page 47: Vision, Language and Action

• VLN is a modern, complex task which involves visual recognition, 3D scene understanding, and language processing.

• The multi-modal information (oral/text instructions, images, depth) elaborated should be produce a sequence of actions.

• Simulation environments and benchmark are continuously under development (this is a good news).

• The passage from the simulation to the real applications is an important open issue.



Page 48: Vision, Language and Action


University of Modena and Reggio Emilia, Italy

Thank you for your attention