|国家预印本平台
首页|Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

来源:Arxiv_logoArxiv
英文摘要

We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.

Jun Wang、Xijuan Zeng、Chunyu Qiang、Ruilong Chen、Shiyao Wang、Le Wang、Wangjing Zhou、Pengfei Cai、Jiahui Zhao、Nan Li、Zihan Li、Yuzhe Liang、Xiaopeng Wang、Haorui Zheng、Ming Wen、Kang Yin、Yiran Wang、Nan Li、Feng Deng、Liang Dong、Chen Zhang、Di Zhang、Kun Gai

电子技术应用

Jun Wang,Xijuan Zeng,Chunyu Qiang,Ruilong Chen,Shiyao Wang,Le Wang,Wangjing Zhou,Pengfei Cai,Jiahui Zhao,Nan Li,Zihan Li,Yuzhe Liang,Xiaopeng Wang,Haorui Zheng,Ming Wen,Kang Yin,Yiran Wang,Nan Li,Feng Deng,Liang Dong,Chen Zhang,Di Zhang,Kun Gai.Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation[EB/OL].(2025-06-24)[2025-07-16].https://arxiv.org/abs/2506.19774.点此复制

评论