Mind the gap: improving success rate of Vision-and-Language navigation by revisiting Oracle Success Routes

Chongyang Zhao, Yuankai Qi, Qi Wu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

2 Citations (Scopus)

Abstract

Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.

Original languageEnglish
Title of host publicationMM '23
Subtitle of host publicationproceedings of the 31st ACM International Conference on Multimedia
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Pages4349-4358
Number of pages10
ISBN (Electronic)9798400701085
DOIs
Publication statusPublished - 2023
Externally publishedYes
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • Vision-and-Language Navigation
  • Multi-Modality Transformer
  • Visual Context Modelling

Fingerprint

Dive into the research topics of 'Mind the gap: improving success rate of Vision-and-Language navigation by revisiting Oracle Success Routes'. Together they form a unique fingerprint.

Cite this