From speaker to dubber: movie dubbing with prosody and duration consistency learning

Zhedong Zhang, Liang Li, Gaoxiang Cong, Haibing Yin, Yuhan Gao, Chenggang Yan, Anton van den Hengel, Yuankai Qi

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

16 Citations (Scopus)

Abstract

Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of one brief reference audio. The wide variations in emotion, pace, and environment that dubbed speech must exhibit to achieve real alignment make dubbing a complex task. Considering the limited scale of the movie dubbing datasets (due to copyright) and the interference from background noise, directly learning from movie dubbing datasets limits the pronunciation quality of learned models. To address this problem, we propose a twostage dubbing method that allows the model to first learn pronunciation knowledge before practicing it in movie dubbing. In the first stage, we introduce a multi-task approach to pre-train a phoneme encoder on a large-scale text-speech corpus for learning clear and natural phoneme pronunciations. For the second stage, we devise a prosody consistency learning module to bridge the emotional expression with the phoneme-level dubbing prosody attributes (pitch and energy). Finally, we design a duration consistency reasoning module to align the dubbing duration with the lip movement. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at https://speaker2dubber.github.io/.
Original languageEnglish
Title of host publicationMM '24
Subtitle of host publicationproceedings of the 32nd ACM International Conference on Multimedia
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Pages7523-7532
Number of pages10
ISBN (Electronic)9798400706868
DOIs
Publication statusPublished - 2024
EventACM International Conference on Multimedia (32nd : 2024) - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024
Conference number: 32nd

Conference

ConferenceACM International Conference on Multimedia (32nd : 2024)
Abbreviated titleMM '24
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • Movie dubbing
  • visual voice cloning
  • two-stage framework

Cite this