RETTA: Retrieval-enhanced test-time adaptation for zero-shot video captioning

Yunchuan Ma, Laiyun Qing*, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pre-trained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This adaptation requires only a few iterations (e.g., 16) and does not require ground truth data. Extensive experimental on MSR-VTT, MSVD, and VATEX, show absolute 5.1 %∼32.4 % improvements in CIDEr scores compared to several state-of-the-art zero-shot video captioning methods.

Original languageEnglish
Article number112170
Pages (from-to)1-10
Number of pages10
JournalPattern Recognition
Volume171
Issue numberPart A
DOIs
Publication statusPublished - Mar 2026

Bibliographical note

Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Retrieval
  • Test-time adaptation
  • Video captioning
  • Zero-shot

Fingerprint

Dive into the research topics of 'RETTA: Retrieval-enhanced test-time adaptation for zero-shot video captioning'. Together they form a unique fingerprint.

Cite this