首页|Investigating self-supervised features for expressive, multilingual voice conversion

Investigating self-supervised features for expressive, multilingual voice conversion

来源：

英文摘要

Voice conversion (VC) systems are widely used for several applications, from speaker anonymisation to personalised speech synthesis. Supervised approaches learn a mapping between different speakers using parallel data, which is expensive to produce. Unsupervised approaches are typically trained to reconstruct the input signal, which is composed of the content and the speaker information. Disentangling these components is a challenge and often leads to speaker leakage or prosodic information removal. In this paper, we explore voice conversion by leveraging the potential of self-supervised learning (SSL). A combination of the latent representations of SSL models, concatenated with speaker embeddings, is fed to a vocoder which is trained to reconstruct the input. Zero-shot voice conversion results show that this approach allows to keep the prosody and content of the source speaker while matching the speaker similarity of a VC system based on phonetic posteriorgrams (PPGs).

作者：álvaro Martín-Cortinas、Daniel Sáez-Trigueros、Grzegorz Beringer、Iván Vallés-Pérez、Roberto Barra-Chicote、Biel Tura-Vecino、Adam Gabry?、Piotr Bilinski、Thomas Merritt、Jaime Lorenzo-Trueba

作者单位：

DOI：10.1109/ICASSPW62465.2024.10627128

学科分类：计算技术、计算机技术

推荐引用：álvaro Martín-Cortinas,Daniel Sáez-Trigueros,Grzegorz Beringer,Iván Vallés-Pérez,Roberto Barra-Chicote,Biel Tura-Vecino,Adam Gabry?,Piotr Bilinski,Thomas Merritt,Jaime Lorenzo-Trueba.Investigating self-supervised features for expressive, multilingual voice conversion[EB/OL].(2025-05-13)[2025-06-09].https://arxiv.org/abs/2505.08278.点此复制

Investigating self-supervised features for expressive, multilingual voice conversion

Investigating self-supervised features for expressive, multilingual voice conversion

评论