TY - JOUR
T1 - Mix-ViT
T2 - mixing attentive vision transformer for ultra-fine-grained visual categorization
AU - Yu, Xiaohan
AU - Wang, Jun
AU - Zhao, Yang
AU - Gao, Yongsheng
PY - 2023/3
Y1 - 2023/3
N2 - Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-granularity categories of fine-grained objects. This inevitably poses a challenge, i.e., classifying highly similar objects with limited samples, which impedes the performance of recent advanced vision transformer methods. To that end, this paper introduces Mix-ViT, a novel mixing attentive vision transformer to address the above challenge towards improved ultra-FGVC. The core design is a self-supervised module that mixes the high-level sample tokens and learns to predict whether a token has been substituted after attentively substituting tokens. This drives the model to understand the contextual discriminative details among inter-class samples. Via incorporating such a self-supervised module, the network gains more knowledge from the intrinsic structure of input data and thus improves generalization capability with limited training sample. The proposed Mix-ViT achieves competitive performance on seven publicly available datasets, demonstrating the potential of vision transformer compared to CNN for the first time in addressing the challenging ultra-FGVC tasks. The code is available at https://github.com/Markin-Wang/MixViT
AB - Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-granularity categories of fine-grained objects. This inevitably poses a challenge, i.e., classifying highly similar objects with limited samples, which impedes the performance of recent advanced vision transformer methods. To that end, this paper introduces Mix-ViT, a novel mixing attentive vision transformer to address the above challenge towards improved ultra-FGVC. The core design is a self-supervised module that mixes the high-level sample tokens and learns to predict whether a token has been substituted after attentively substituting tokens. This drives the model to understand the contextual discriminative details among inter-class samples. Via incorporating such a self-supervised module, the network gains more knowledge from the intrinsic structure of input data and thus improves generalization capability with limited training sample. The proposed Mix-ViT achieves competitive performance on seven publicly available datasets, demonstrating the potential of vision transformer compared to CNN for the first time in addressing the challenging ultra-FGVC tasks. The code is available at https://github.com/Markin-Wang/MixViT
KW - Ultra-fine-grained visual categorization
KW - Vision transformer
KW - Self-supervised learning
KW - Attentive mixing
UR - http://www.scopus.com/inward/record.url?scp=85141303699&partnerID=8YFLogxK
UR - http://purl.org/au-research/grants/arc/DP180100958
UR - http://purl.org/au-research/grants/arc/IH180100002
U2 - 10.1016/j.patcog.2022.109131
DO - 10.1016/j.patcog.2022.109131
M3 - Article
SN - 0031-3203
VL - 135
SP - 1
EP - 10
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 109131
ER -