TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.
Junyi Peng、Takanori Ashihara、Marc Delcroix、Tsubasa Ochiai、Oldrich Plchot、Shoko Araki、Jan ?ernocky
计算技术、计算机技术
Junyi Peng,Takanori Ashihara,Marc Delcroix,Tsubasa Ochiai,Oldrich Plchot,Shoko Araki,Jan ?ernocky.TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models[EB/OL].(2025-05-10)[2025-06-25].https://arxiv.org/abs/2505.06660.点此复制
评论