Indoor scene recognition in vision-and-language navigation

Hongtao Zhang, Yuankai Qi, Mingbo Zhao*, Yuping Liu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Indoor scene recognition is a crucial component in vision-and-language navigation (VLN), which involves guiding an agent to navigate through unseen, photo-realistic environments using natural language instructions. However, existing researches have not yet endowed VLN agents with robust indoor scene recognition capabilities, which are essential for enhancing the agent’s spatial perception and environmental understanding. In this paper, we introduce the Room-to-Room Scene Recognition (R2R-SR) dataset, designed to enhance the spatial perception capabilities of agents in VLN. The R2R-SR dataset includes 7,283 images categorized into 10 room classes, sourced from the Matterport3D (MP3D) simulator. To alleviate the imbalance in training data distribution across room categories, we propose a panoramic environment generation technique using Llava and Stable Diffusion for data Augmentation (LSDA). Our proposed LSDA-based model achieves a Top-1 accuracy of 75.9%, outperforming the best competitors in the convolutional neural network (CNN) and vision transformer (ViT) domains. The qualitative results highlight the model’s proficiency in identifying distinct features across various room categories, showcasing its effectiveness in indoor scene recognition tasks.

Original languageEnglish
Number of pages7
JournalIEEE Transactions on Consumer Electronics
Early online date18 Dec 2025
DOIs
Publication statusE-pub ahead of print - 18 Dec 2025

Keywords

  • convolutional neural network
  • scene recognition
  • transformer
  • vision-and-language navigation

Fingerprint

Dive into the research topics of 'Indoor scene recognition in vision-and-language navigation'. Together they form a unique fingerprint.

Cite this