TY - JOUR
T1 - Tracking with mutual attention network
AU - Liu, Tianpeng
AU - Li, Jing
AU - Wu, Jia
AU - Chang, Jun
AU - Song, Beihang
AU - Yao, Bowen
PY - 2023
Y1 - 2023
N2 - Visual tracking is a visual task that tracks a specific target by only giving its first frame location and size. To punish the low-quality but high-scoring tracking results, researchers resorted to foreground reinforcement learning to suppress the scores of positive samples near edges. However, for training with negative samples, all backgrounds are equally labeled as false. In this way, the interdependence and difference between the foreground and the background are not considered. We interpret the underlying reason for drifts as the imbalance between the embedding of background and foreground information. Specifically, some catastrophic tracking results and common tracking errors should not be treated equally but should strengthen the implicit connection between the foreground and background. In this paper, we propose a Mutual Attention (MA) module to strengthen the interdependence between positive and negative samples. It can aggregate the rich contextual interdependence between the target template and the search area, thereby providing an implicit way to update the target template accordingly. As for the difference, we design a background training enhancement (BTE) mechanism to distinguish negative samples with varying degrees of error, that is, to down-weight outrageous and absurd tracking results to improve the robustness of the tracker. The results on a large number of benchmarks indicate the validity of our results, such as OTB-100, VOT-2018, VOT-2019, and LaSOT.
AB - Visual tracking is a visual task that tracks a specific target by only giving its first frame location and size. To punish the low-quality but high-scoring tracking results, researchers resorted to foreground reinforcement learning to suppress the scores of positive samples near edges. However, for training with negative samples, all backgrounds are equally labeled as false. In this way, the interdependence and difference between the foreground and the background are not considered. We interpret the underlying reason for drifts as the imbalance between the embedding of background and foreground information. Specifically, some catastrophic tracking results and common tracking errors should not be treated equally but should strengthen the implicit connection between the foreground and background. In this paper, we propose a Mutual Attention (MA) module to strengthen the interdependence between positive and negative samples. It can aggregate the rich contextual interdependence between the target template and the search area, thereby providing an implicit way to update the target template accordingly. As for the difference, we design a background training enhancement (BTE) mechanism to distinguish negative samples with varying degrees of error, that is, to down-weight outrageous and absurd tracking results to improve the robustness of the tracker. The results on a large number of benchmarks indicate the validity of our results, such as OTB-100, VOT-2018, VOT-2019, and LaSOT.
UR - http://www.scopus.com/inward/record.url?scp=85135248179&partnerID=8YFLogxK
U2 - 10.1109/TMM.2022.3190679
DO - 10.1109/TMM.2022.3190679
M3 - Article
AN - SCOPUS:85135248179
SN - 1520-9210
VL - 25
SP - 5330
EP - 5343
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -