|国家预印本平台
首页|Investigating self-supervised features for expressive, multilingual voice conversion

Investigating self-supervised features for expressive, multilingual voice conversion

Investigating self-supervised features for expressive, multilingual voice conversion

来源:Arxiv_logoArxiv
英文摘要

Voice conversion (VC) systems are widely used for several applications, from speaker anonymisation to personalised speech synthesis. Supervised approaches learn a mapping between different speakers using parallel data, which is expensive to produce. Unsupervised approaches are typically trained to reconstruct the input signal, which is composed of the content and the speaker information. Disentangling these components is a challenge and often leads to speaker leakage or prosodic information removal. In this paper, we explore voice conversion by leveraging the potential of self-supervised learning (SSL). A combination of the latent representations of SSL models, concatenated with speaker embeddings, is fed to a vocoder which is trained to reconstruct the input. Zero-shot voice conversion results show that this approach allows to keep the prosody and content of the source speaker while matching the speaker similarity of a VC system based on phonetic posteriorgrams (PPGs).

álvaro Martín-Cortinas、Daniel Sáez-Trigueros、Grzegorz Beringer、Iván Vallés-Pérez、Roberto Barra-Chicote、Biel Tura-Vecino、Adam Gabry?、Piotr Bilinski、Thomas Merritt、Jaime Lorenzo-Trueba

10.1109/ICASSPW62465.2024.10627128

计算技术、计算机技术

álvaro Martín-Cortinas,Daniel Sáez-Trigueros,Grzegorz Beringer,Iván Vallés-Pérez,Roberto Barra-Chicote,Biel Tura-Vecino,Adam Gabry?,Piotr Bilinski,Thomas Merritt,Jaime Lorenzo-Trueba.Investigating self-supervised features for expressive, multilingual voice conversion[EB/OL].(2025-05-13)[2025-06-09].https://arxiv.org/abs/2505.08278.点此复制

评论