Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning
Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning
The effectiveness of one-shot voice conversion (VC) decreases in real-world scenarios where reference speeches, which are often sourced from the internet, contain various disturbances like background noise. To address this issue, we introduce Noro, a noise-robust one-shot VC system. Noro features innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. Experimental results demonstrate that Noro outperforms our baseline system in both clean and noisy scenarios, highlighting its efficacy for real-world applications. Additionally, we investigate the hidden speaker representation capabilities of our baseline system by repurposing its reference encoder as a speaker encoder. The results show that it is competitive with several advanced self-supervised learning models for speaker representation under the SUPERB settings, highlighting the potential for advancing speaker representation learning through one-shot VC tasks.
Gongping Huang、Haoyang Li、Xueyao Zhang、Li Wang、Yuchen Song、Haorui He、Yuancheng Wang、Eng Siong Chng、Zhizheng Wu
计算技术、计算机技术
Gongping Huang,Haoyang Li,Xueyao Zhang,Li Wang,Yuchen Song,Haorui He,Yuancheng Wang,Eng Siong Chng,Zhizheng Wu.Noro: Noise-Robust One-shot Voice Conversion with Hidden Speaker Representation Learning[EB/OL].(2025-08-28)[2025-09-06].https://arxiv.org/abs/2411.19770.点此复制
评论