|国家预印本平台
首页|EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

来源:Arxiv_logoArxiv
英文摘要

We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.

Jiarui Hai、Yong Xu、Hao Zhang、Chenxing Li、Helin Wang、Mounya Elhilali、Dong Yu

计算技术、计算机技术

Jiarui Hai,Yong Xu,Hao Zhang,Chenxing Li,Helin Wang,Mounya Elhilali,Dong Yu.EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer[EB/OL].(2025-06-19)[2025-07-16].https://arxiv.org/abs/2409.10819.点此复制

评论