Skip to main navigation Skip to search Skip to main content

Local self-attention in transformer for visual question answering

Xiang Shen, Dezhi Han*, Zihan Guo, Chongqing Chen, Jie Hua, Gaofeng Luo

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.

Original languageEnglish
Pages (from-to)16706-16723
Number of pages18
JournalApplied Intelligence
Volume53
Issue number13
DOIs
Publication statusPublished - Jul 2023
Externally publishedYes

Keywords

  • Transformer
  • Local self-attention
  • Grid/regional visual features
  • Visual question answering

Fingerprint

Dive into the research topics of 'Local self-attention in transformer for visual question answering'. Together they form a unique fingerprint.

Cite this