Abstract
Video compact representation aims to obtain a representation that could reflect the kernel mode of video content and concisely describe the video. As most information in complex videos is either noisy or redundant, some researchers have instead focused on long-term video semantics. Recent video compact representation methods heavily rely on the segmentation accuracy of video semantics. In this paper, we propose a novel framework to address these challenges. Specifically, we designed a novel continuous video semantic embedding model to learn the actual distribution of video words. First, an embedding model based on the continuous bag of words method is proposed to learn the video embeddings, integrated with a well-designed discriminative negative sampling approach, which helps emphasize the convincing clips in the embedding while weakening the influence of the confusing ones. Second, an aggregated distribution pooling method is proposed to capture the semantic distribution of kernel modes in videos. Finally, our well-trained model can generate compact video representations by direct inference, which provides our model with a better generalization ability compared with those of previous methods. We performed extensive experiments on event detection and the mining of representative event parts. Experiments on TRECVID MED11 and CCV datasets demonstrated the effectiveness of our method. Our method could capture the semantic distribution of kernel modes in videos and shows powerful potential to discover and better describe complex video patterns.
Original language | English |
---|---|
Article number | 3106 |
Pages (from-to) | 1-14 |
Number of pages | 14 |
Journal | Electronics |
Volume | 10 |
Issue number | 24 |
DOIs | |
Publication status | Published - 2 Dec 2021 |
Externally published | Yes |
Bibliographical note
Copyright the Author(s) 2021. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.Keywords
- discriminative sampling
- video compact representation
- video embedding