Abstract
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pre-trained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This adaptation requires only a few iterations (e.g., 16) and does not require ground truth data. Extensive experimental on MSR-VTT, MSVD, and VATEX, show absolute 5.1 %∼32.4 % improvements in CIDEr scores compared to several state-of-the-art zero-shot video captioning methods.
| Original language | English |
|---|---|
| Article number | 112170 |
| Pages (from-to) | 1-10 |
| Number of pages | 10 |
| Journal | Pattern Recognition |
| Volume | 171 |
| Issue number | Part A |
| DOIs | |
| Publication status | Published - Mar 2026 |
Bibliographical note
Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.Keywords
- Retrieval
- Test-time adaptation
- Video captioning
- Zero-shot