首页|REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

来源：

英文摘要

In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL features, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that REF-VC outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.

作者：Zhonghua Fu、Lei Xie、Yuepeng Jiang、Ziqian Ning、Shuai Wang、Chengjia Wang、Mengxiao Bi、Pengcheng Zhu

作者单位：

学科分类：通信

推荐引用：Zhonghua Fu,Lei Xie,Yuepeng Jiang,Ziqian Ning,Shuai Wang,Chengjia Wang,Mengxiao Bi,Pengcheng Zhu.REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers[EB/OL].(2025-08-08)[2025-08-18].https://arxiv.org/abs/2508.04996.点此复制

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

评论