Prosody-enhanced acoustic pre-training and acoustic-disentangled prosody adapting for movie dubbing

Zhedong Zhang, Liang Li*, Chenggang Yan, Chunshan Liu, Anton Van Den Hengel, Yuankai Qi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design an acoustic-disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The project is available at https://zzdoog.github.io/ProDubber/.

Original languageEnglish
Title of host publication2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2025
Subtitle of host publicationproceedings
Place of PublicationPiscataway, NJ
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages172-182
Number of pages11
ISBN (Electronic)9798331543648
ISBN (Print)9798331543655
DOIs
Publication statusPublished - 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Publication series

Name
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

Conference2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Country/TerritoryUnited States
CityNashville
Period11/06/2515/06/25

Fingerprint

Dive into the research topics of 'Prosody-enhanced acoustic pre-training and acoustic-disentangled prosody adapting for movie dubbing'. Together they form a unique fingerprint.

Cite this