CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising
Li, Liang1; Cong, Gaoxiang1,2; Qi, Yuankai3; Zha, Zheng-Jun4; Wu, Qi5; Sheng, Quan Z.3; Huang, Qingming6,7; Yang, Ming-Hsuan8
2025-11-01
发表期刊IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
ISSN0162-8828
卷号47期号:11页码:10361-10377
摘要Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech synthesis tasks. To align the generated speech with the inherent lip motion of the given silent video, most existing works utilize each video frame to query textual phonemes. However, such an attention operation usually leads to mumble speech because different phonemes are fused for video frames corresponding to one phoneme (video frames are finer-grained than phonemes). To address this issue, we propose a diffusion-based movie dubbing architecture, which improves pronunciation by Hierarchical Phoneme Modeling (HPM) and generates better mel-spectrogram through Acoustic Diffusion Denoising (ADD). We term our model as HD-Dubber. Specifically, our HPM bridges the visual information and corresponding speech prosody from three aspects: (1) aligning lip movement with the speech duration based on each phoneme unit by contrastive learning; (2) conveying facial expression to phoneme-level energy and pitch; and (3) injecting global emotions captured from video scenes into prosody. On the other hand, ADD exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via a parameterized Markov chain conditioned on textual phonemes and reference audio. ADD has two novel denoisers, the Style-adaptive Residual Denoiser (SRD) and the Phoneme-enhanced U-net Denoiser (PUD), to enhance speaker similarity and improve pronunciation quality. Extensive experimental results on the three benchmark datasets demonstrate the state-of-the-art performance of the proposed method. The source code and trained models will be made available to the public.
关键词Videos Lips Visualization Acoustics Cloning Noise reduction Motion pictures Head Adaptation models Text to speech Visual voice cloning speech synthesis hierarchical phoneme modeling contrastive learning acoustic diffusion denoising
DOI10.1109/TPAMI.2025.3597267
收录类别SCI
语种英语
资助项目National Nature Science Foundation of China[62322211] ; National Nature Science Foundation of China[62236008] ; Pioneer and Leading Goose R&D Program of Zhejiang Province[2024C01023] ; Key Laboratory of Intelligent Processing Technology for Digital Music (Zhejiang Conservatory of Music), Ministry of Culture and Tourism[2023DMKLB004]
WOS研究方向Computer Science ; Engineering
WOS类目Computer Science, Artificial Intelligence ; Engineering, Electrical & Electronic
WOS记录号WOS:001587252700038
出版者IEEE COMPUTER SOC
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/41625
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Cong, Gaoxiang; Qi, Yuankai
作者单位1.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
2.Univ Chinese Acad Sci, Beijing 101408, Peoples R China
3.Macquarie Univ, Sch Comp, Sydney, NSW 2113, Australia
4.Univ Sci & Technol China, Hefei 230052, Peoples R China
5.Univ Adelaide, Australian Inst Machine Learning, Sch Comp Sci, Adelaide, SA 5005, Australia
6.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China
7.Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
8.Univ Calif Merced, EECS, Merced, CA 95344 USA
推荐引用方式
GB/T 7714
Li, Liang,Cong, Gaoxiang,Qi, Yuankai,et al. Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,2025,47(11):10361-10377.
APA Li, Liang.,Cong, Gaoxiang.,Qi, Yuankai.,Zha, Zheng-Jun.,Wu, Qi.,...&Yang, Ming-Hsuan.(2025).Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising.IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,47(11),10361-10377.
MLA Li, Liang,et al."Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising".IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 47.11(2025):10361-10377.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Li, Liang]的文章
[Cong, Gaoxiang]的文章
[Qi, Yuankai]的文章
百度学术
百度学术中相似的文章
[Li, Liang]的文章
[Cong, Gaoxiang]的文章
[Qi, Yuankai]的文章
必应学术
必应学术中相似的文章
[Li, Liang]的文章
[Cong, Gaoxiang]的文章
[Qi, Yuankai]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。