TY - JOUR
T1 - Gait-assisted video person retrieval
AU - Zhao, Yang
AU - Wang, Xinlong
AU - Yu, Xiaohan
AU - Liu, Chunlei
AU - Gao, Yongsheng
PY - 2023/2
Y1 - 2023/2
N2 - Video person retrieval aims at matching video clips of the same person across non-overlapping camera views, where video sequences contain more comprehensive information, e.g., temporal cues. How to extract useful temporal cues is the key to the success of a video person retrieval system. Gait, as a unique biometric modality indicating the way people walk, contains informative temporal information. To date, it is not clear how to fully utilize gait to boost the performance of video person retrieval. In this paper, to validate whether gait could help retrieve person in videos, we build a two-stream architecture, named appearance-gait network (AGNet), to jointly learn the appearance features and gait features from RGB video clips and silhouette video clips. We further explore how to fully utilize gait features to enhance the video feature representation. Specifically, we propose an appearance-gait attention module (AGA) to fuse a discriminative feature representation for the person retrieval task. Furthermore, to eliminate the requirement of silhouette video clips during inference, we propose a simple yet effective appearance-gait distillation module (AGD) which transfers the gait knowledge to appearance stream. As such, we are able to perform the enhanced video person retrieval without silhouette video clips, which makes the inference more flexible and practical. To the best of our knowledge, our work is the first to successfully introduce such appearance-gait knowledge distillation design for video person retrieval. We verify the effectiveness of the proposed methods on two large-scale challenging benchmarks of MARS and DukeMTMC-VideoReID. Extensive experiments demonstrate superior or comparable performance compared to the state-of-the-art methods while being much simpler. Source code is publicly available at https://github.com/yangyangkiki/Gait-Assisted-Video-Reid.
AB - Video person retrieval aims at matching video clips of the same person across non-overlapping camera views, where video sequences contain more comprehensive information, e.g., temporal cues. How to extract useful temporal cues is the key to the success of a video person retrieval system. Gait, as a unique biometric modality indicating the way people walk, contains informative temporal information. To date, it is not clear how to fully utilize gait to boost the performance of video person retrieval. In this paper, to validate whether gait could help retrieve person in videos, we build a two-stream architecture, named appearance-gait network (AGNet), to jointly learn the appearance features and gait features from RGB video clips and silhouette video clips. We further explore how to fully utilize gait features to enhance the video feature representation. Specifically, we propose an appearance-gait attention module (AGA) to fuse a discriminative feature representation for the person retrieval task. Furthermore, to eliminate the requirement of silhouette video clips during inference, we propose a simple yet effective appearance-gait distillation module (AGD) which transfers the gait knowledge to appearance stream. As such, we are able to perform the enhanced video person retrieval without silhouette video clips, which makes the inference more flexible and practical. To the best of our knowledge, our work is the first to successfully introduce such appearance-gait knowledge distillation design for video person retrieval. We verify the effectiveness of the proposed methods on two large-scale challenging benchmarks of MARS and DukeMTMC-VideoReID. Extensive experiments demonstrate superior or comparable performance compared to the state-of-the-art methods while being much simpler. Source code is publicly available at https://github.com/yangyangkiki/Gait-Assisted-Video-Reid.
UR - http://www.scopus.com/inward/record.url?scp=85137915936&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2022.3202531
DO - 10.1109/TCSVT.2022.3202531
M3 - Article
SN - 1051-8215
VL - 33
SP - 897
EP - 908
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 2
ER -