HOP: history-and-order aware pretraining for Vision-and-Language Navigation

Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

38 Citations (Scopus)

Abstract

Pretraining has been adopted in a few of recent works for Vision-and-Language Navigation (VLN). However, pre-vious pre-training methods for VLN either lack the ability to predict future actions or ignore the trajectory contexts, which are essential for a greedy navigation process. In this work, to promote the learning of spatio-temporal visual-textual correspondence as well as the agent's capability of decision making, we propose a novel history-and-order aware pre-training paradigm (HOP) with VLN-specific objectives that exploit the past observations and support future action prediction. Specifically, in addition to the commonly used Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM), we design two proxy tasks to model temporal order information: Trajectory Order Modeling (TOM) and Group Order Modeling (GOM). Moreover, our navigation action prediction is also enhanced by intro-ducing the task of Action Prediction with History (APH), which takes into account the history visual perceptions. Extensive experimental results on four downstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed method compared against several state-of-the-art agents.

Original languageEnglish
Title of host publication2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2022
Subtitle of host publicationproceedings
Place of PublicationPiscataway, NJ
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages15397-15406
Number of pages10
ISBN (Electronic)9781665469463
ISBN (Print)9781665469470
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: 19 Jun 202224 Jun 2022

Publication series

Name
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans
Period19/06/2224/06/22

Fingerprint

Dive into the research topics of 'HOP: history-and-order aware pretraining for Vision-and-Language Navigation'. Together they form a unique fingerprint.

Cite this