Tracking with mutual attention network

Tianpeng Liu, Jing Li*, Jia Wu*, Jun Chang, Beihang Song, Bowen Yao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

Visual tracking is a visual task that tracks a specific target by only giving its first frame location and size. To punish the low-quality but high-scoring tracking results, researchers resorted to foreground reinforcement learning to suppress the scores of positive samples near edges. However, for training with negative samples, all backgrounds are equally labeled as false. In this way, the interdependence and difference between the foreground and the background are not considered. We interpret the underlying reason for drifts as the imbalance between the embedding of background and foreground information. Specifically, some catastrophic tracking results and common tracking errors should not be treated equally but should strengthen the implicit connection between the foreground and background. In this paper, we propose a Mutual Attention (MA) module to strengthen the interdependence between positive and negative samples. It can aggregate the rich contextual interdependence between the target template and the search area, thereby providing an implicit way to update the target template accordingly. As for the difference, we design a background training enhancement (BTE) mechanism to distinguish negative samples with varying degrees of error, that is, to down-weight outrageous and absurd tracking results to improve the robustness of the tracker. The results on a large number of benchmarks indicate the validity of our results, such as OTB-100, VOT-2018, VOT-2019, and LaSOT.

Original languageEnglish
Pages (from-to)5330-5343
Number of pages14
JournalIEEE Transactions on Multimedia
Volume25
DOIs
Publication statusPublished - 2023

Cite this