Abstract
Indoor scene recognition is a crucial component in vision-and-language navigation (VLN), which involves guiding an agent to navigate through unseen, photo-realistic environments using natural language instructions. However, existing researches have not yet endowed VLN agents with robust indoor scene recognition capabilities, which are essential for enhancing the agent’s spatial perception and environmental understanding. In this paper, we introduce the Room-to-Room Scene Recognition (R2R-SR) dataset, designed to enhance the spatial perception capabilities of agents in VLN. The R2R-SR dataset includes 7,283 images categorized into 10 room classes, sourced from the Matterport3D (MP3D) simulator. To alleviate the imbalance in training data distribution across room categories, we propose a panoramic environment generation technique using Llava and Stable Diffusion for data Augmentation (LSDA). Our proposed LSDA-based model achieves a Top-1 accuracy of 75.9%, outperforming the best competitors in the convolutional neural network (CNN) and vision transformer (ViT) domains. The qualitative results highlight the model’s proficiency in identifying distinct features across various room categories, showcasing its effectiveness in indoor scene recognition tasks.
| Original language | English |
|---|---|
| Number of pages | 7 |
| Journal | IEEE Transactions on Consumer Electronics |
| Early online date | 18 Dec 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 18 Dec 2025 |
Keywords
- convolutional neural network
- scene recognition
- transformer
- vision-and-language navigation
Fingerprint
Dive into the research topics of 'Indoor scene recognition in vision-and-language navigation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver