TY - JOUR
T1 - HOP+
T2 - history-enhanced and order-aware pre-training for vision-and-language navigation
AU - Qiao, Yanyuan
AU - Qi, Yuankai
AU - Hong, Yicong
AU - Yu, Zheng
AU - Wang, Peng
AU - Wu, Qi
PY - 2023/7
Y1 - 2023/7
N2 - Recent works attempt to employ pre-training in Vision-and-Language Navigation (VLN). However, these methods neglect the importance of historical contexts or ignore predicting future actions during pre-training, limiting the learning of visual-textual correspondence and the capability of decision-making. To address these problems, we present a history-enhanced and order-aware pre-training with the complementing fine-tuning paradigm (HOP+) for VLN. Specifically, besides the common Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM) tasks, we design three novel VLN-specific proxy tasks: Action Prediction with History (APH) task, Trajectory Order Modeling (TOM) task and Group Order Modeling (GOM) task. APH task takes into account the visual perception trajectory to enhance the learning of historical knowledge as well as action prediction. The two temporal visual-textual alignment tasks, TOM and GOM further improve the agent's ability to order reasoning. Moreover, we design a memory network to address the representation inconsistency of history context between the pre-training and the fine-tuning stages. The memory network effectively selects and summarizes historical information for action prediction during fine-tuning, without costing huge extra computation consumption for downstream VLN tasks. HOP+ achieves new state-of-the-art performance on four downstream VLN tasks (R2R, REVERIE, RxR, and NDH), which demonstrates the effectiveness of our proposed method.
AB - Recent works attempt to employ pre-training in Vision-and-Language Navigation (VLN). However, these methods neglect the importance of historical contexts or ignore predicting future actions during pre-training, limiting the learning of visual-textual correspondence and the capability of decision-making. To address these problems, we present a history-enhanced and order-aware pre-training with the complementing fine-tuning paradigm (HOP+) for VLN. Specifically, besides the common Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM) tasks, we design three novel VLN-specific proxy tasks: Action Prediction with History (APH) task, Trajectory Order Modeling (TOM) task and Group Order Modeling (GOM) task. APH task takes into account the visual perception trajectory to enhance the learning of historical knowledge as well as action prediction. The two temporal visual-textual alignment tasks, TOM and GOM further improve the agent's ability to order reasoning. Moreover, we design a memory network to address the representation inconsistency of history context between the pre-training and the fine-tuning stages. The memory network effectively selects and summarizes historical information for action prediction during fine-tuning, without costing huge extra computation consumption for downstream VLN tasks. HOP+ achieves new state-of-the-art performance on four downstream VLN tasks (R2R, REVERIE, RxR, and NDH), which demonstrates the effectiveness of our proposed method.
UR - http://www.scopus.com/inward/record.url?scp=85147228971&partnerID=8YFLogxK
UR - http://purl.org/au-research/grants/arc/DE190100539
U2 - 10.1109/TPAMI.2023.3234243
DO - 10.1109/TPAMI.2023.3234243
M3 - Article
C2 - 37018268
AN - SCOPUS:85147228971
SN - 0162-8828
VL - 45
SP - 8524
EP - 8537
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 7
ER -