Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization

Xiaohan Yu, Jun Wang, Yang Zhao, Yongsheng Gao*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

19 Citations (Scopus)


Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-granularity categories of fine-grained objects. This inevitably poses a challenge, i.e., classifying highly similar objects with limited samples, which impedes the performance of recent advanced vision transformer methods. To that end, this paper introduces Mix-ViT, a novel mixing attentive vision transformer to address the above challenge towards improved ultra-FGVC. The core design is a self-supervised module that mixes the high-level sample tokens and learns to predict whether a token has been substituted after attentively substituting tokens. This drives the model to understand the contextual discriminative details among inter-class samples. Via incorporating such a self-supervised module, the network gains more knowledge from the intrinsic structure of input data and thus improves generalization capability with limited training sample. The proposed Mix-ViT achieves competitive performance on seven publicly available datasets, demonstrating the potential of vision transformer compared to CNN for the first time in addressing the challenging ultra-FGVC tasks. The code is available at https://github.com/Markin-Wang/MixViT

Original languageEnglish
Article number109131
Pages (from-to)1-10
Number of pages10
JournalPattern Recognition
Publication statusPublished - Mar 2023
Externally publishedYes


  • Ultra-fine-grained visual categorization
  • Vision transformer
  • Self-supervised learning
  • Attentive mixing


Dive into the research topics of 'Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization'. Together they form a unique fingerprint.

Cite this