Skip to main navigation Skip to search Skip to main content

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a novel dubbing architecture based on Large Language Model (LLM) and Conditional Flow Matching (CFM), named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model with dual contrastive alignment while improving acoustic quality via Flow-based Voice Enhancing (FVE). First, we introduce Qwen2.5 as the backbone of large speech language model to learn the in-context sequence from movie scripts and reference audio. Second, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level, which facilitates mutual alignment with lip movement from silent video via Dual Contrastive Alignment (DCA). Third, the FVE introduces an LLM-based acoustics flow matching guidance to strengthen clarity by decoupling Classifier-Free Guidance (CFG) enhancement. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at https://galaxycong.github.io/LLM-Flow-Dubber/.
Original languageEnglish
Title of host publicationMM '25
Subtitle of host publicationThe 33rd ACM International Conference on Multimedia
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery (ACM)
Pages905-914
Number of pages10
ISBN (Electronic)9798400720352
DOIs
Publication statusPublished - 27 Oct 2025
EventACM International Conference on Multimedia (33rd : 2025) - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Conference

ConferenceACM International Conference on Multimedia (33rd : 2025)
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • Movie Dubbing
  • Visual Voice Cloning
  • Flow Matching

Fingerprint

Dive into the research topics of 'FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing'. Together they form a unique fingerprint.

Cite this