Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Analysis of ABC Frontend Audio Systems for the NIST-SRE24
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.
Sara Barahona、Anna Silnova、Ladislav Mo?ner、Junyi Peng、Old?ich Plchot、Johan Rohdin、Lin Zhang、Jiangyu Han、Petr Palka、Federico Landini、Luká? Burget、Themos Stafylakis、Sandro Cumani、Dominik Bobo?、Miroslav Hlava?ek、Martin Kodovsky、Tomá? Pavlí?ek
通信无线通信
Sara Barahona,Anna Silnova,Ladislav Mo?ner,Junyi Peng,Old?ich Plchot,Johan Rohdin,Lin Zhang,Jiangyu Han,Petr Palka,Federico Landini,Luká? Burget,Themos Stafylakis,Sandro Cumani,Dominik Bobo?,Miroslav Hlava?ek,Martin Kodovsky,Tomá? Pavlí?ek.Analysis of ABC Frontend Audio Systems for the NIST-SRE24[EB/OL].(2025-05-21)[2025-06-06].https://arxiv.org/abs/2505.15320.点此复制
评论