TY - GEN
T1 - Weakly supervised video individual counting
AU - Liu, Xinyan
AU - Li, Guorong
AU - Qi, Yuankai
AU - Yan, Ziheng
AU - Han, Zhenjun
AU - Van Den Hengel, Anton
AU - Yang, Ming-Hsuan
AU - Huang, Qingming
PY - 2024
Y1 - 2024
N2 - Video Individual Counting (VIC) aims to predict the number of unique individuals in a single video. Existing methods learn representations based on trajectory labels for individuals, which are annotation-expensive. To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided. Instead, two types of labels are provided to indicate traffic entering the field of view (inflow) and leaving the field view (outflow). We also propose the first solution as a baseline that formulates the task as a weakly supervised contrastive learning problem under group-level matching. In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distin-guish inflow, outflow, and the remaining. To facilitate future study in this direction, we generate annotations from the existing VIC datasets Sense Crowd and CroHD and also build a new dataset, UAVVIC. Extensive results show that our baseline weakly supervised method outperforms supervised methods, and thus, little information is lost in the transition to the more practically relevant weakly supervised task. The code and trained model can be found at CGNet.
AB - Video Individual Counting (VIC) aims to predict the number of unique individuals in a single video. Existing methods learn representations based on trajectory labels for individuals, which are annotation-expensive. To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided. Instead, two types of labels are provided to indicate traffic entering the field of view (inflow) and leaving the field view (outflow). We also propose the first solution as a baseline that formulates the task as a weakly supervised contrastive learning problem under group-level matching. In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distin-guish inflow, outflow, and the remaining. To facilitate future study in this direction, we generate annotations from the existing VIC datasets Sense Crowd and CroHD and also build a new dataset, UAVVIC. Extensive results show that our baseline weakly supervised method outperforms supervised methods, and thus, little information is lost in the transition to the more practically relevant weakly supervised task. The code and trained model can be found at CGNet.
UR - http://www.scopus.com/inward/record.url?scp=85207245124&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.01819
DO - 10.1109/CVPR52733.2024.01819
M3 - Conference proceeding contribution
AN - SCOPUS:85207245124
SN - 9798350353013
SP - 19228
EP - 19237
BT - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2024
PB - Institute of Electrical and Electronics Engineers (IEEE)
CY - Piscataway, NJ
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Y2 - 16 June 2024 through 22 June 2024
ER -