Abstract
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.
Original language | English |
---|---|
Title of host publication | MM '23 |
Subtitle of host publication | proceedings of the 31st ACM International Conference on Multimedia |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 4349-4358 |
Number of pages | 10 |
ISBN (Electronic) | 9798400701085 |
DOIs | |
Publication status | Published - 2023 |
Externally published | Yes |
Event | 31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada Duration: 29 Oct 2023 → 3 Nov 2023 |
Conference
Conference | 31st ACM International Conference on Multimedia, MM 2023 |
---|---|
Country/Territory | Canada |
City | Ottawa |
Period | 29/10/23 → 3/11/23 |
Keywords
- Vision-and-Language Navigation
- Multi-Modality Transformer
- Visual Context Modelling