V2C: Visual Voice Cloning

Qi Chen, Mingkui Tan, Yuankai Qi, Jiaqiu Zhou, Yuanqing Li*, Qi Wu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

7 Citations (Scopus)


Existing Voice Cloning (VC) tasks aim to convert a para-graph text to a speech with desired voice specified by a ref-erence audio. This has significantly boosted the development of artificial speech applications. However, there also exist many scenarios that cannot be well reflected by these VC tasks, such as movie dubbing, which requires the speech to be with emotions consistent with the movie plots. To fill this gap, in this work we propose a new task named Vi-sual Voice Cloning (V2C), which seeks to convert a para-graph of text to a speech with both desired voice speci-fied by a reference audio and desired emotion specified by a reference video. To facilitate research in this field, we construct a dataset, V2C-Animation, and propose a strong baseline based on existing state-of-the-art (SoTA) VC techniques. Our dataset contains 10,217 animated movie clips covering a large variety of genres (e.g., Comedy, Fantasy) and emotions (e.g., happy, sad). We further design a set of evaluation metrics, named MCD-DTW-SL, which help eval-uate the similarity between ground-truth speeches and the synthesised ones. Extensive experimental results show that even SoTA VC methods cannot generate satisfying speeches for our V2C task. We hope the proposed new task together with the constructed dataset and evaluation metric will fa-cilitate the research in the field of voice cloning and broader vision-and-language community. Source code and dataset will be released in https://github.com/chenqi008/V2C.

Original languageEnglish
Title of host publication2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2022
Subtitle of host publicationproceedings
Place of PublicationPiscataway, NJ
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages10
ISBN (Electronic)9781665469463
ISBN (Print)9781665469470
Publication statusPublished - 2022
Externally publishedYes
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: 19 Jun 202224 Jun 2022

Publication series

ISSN (Print)1063-6919
ISSN (Electronic)2575-7075


Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans


Dive into the research topics of 'V2C: Visual Voice Cloning'. Together they form a unique fingerprint.

Cite this