TY - GEN
T1 - Prosody-enhanced acoustic pre-training and acoustic-disentangled prosody adapting for movie dubbing
AU - Zhang, Zhedong
AU - Li, Liang
AU - Yan, Chenggang
AU - Liu, Chunshan
AU - Van Den Hengel, Anton
AU - Qi, Yuankai
PY - 2025
Y1 - 2025
N2 - Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design an acoustic-disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The project is available at https://zzdoog.github.io/ProDubber/.
AB - Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design an acoustic-disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The project is available at https://zzdoog.github.io/ProDubber/.
UR - http://www.scopus.com/inward/record.url?scp=105017055023&partnerID=8YFLogxK
U2 - 10.1109/CVPR52734.2025.00025
DO - 10.1109/CVPR52734.2025.00025
M3 - Conference proceeding contribution
AN - SCOPUS:105017055023
SN - 9798331543655
SP - 172
EP - 182
BT - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2025
PB - Institute of Electrical and Electronics Engineers (IEEE)
CY - Piscataway, NJ
T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Y2 - 11 June 2025 through 15 June 2025
ER -