|国家预印本平台
首页|Exploring the Potential of SSL Models for Sound Event Detection

Exploring the Potential of SSL Models for Sound Event Detection

Exploring the Potential of SSL Models for Sound Event Detection

来源:Arxiv_logoArxiv
英文摘要

Self-supervised learning (SSL) models offer powerful representations for sound event detection (SED), yet their synergistic potential remains underexplored. This study systematically evaluates state-of-the-art SSL models to guide optimal model selection and integration for SED. We propose a framework that combines heterogeneous SSL representations (e.g., BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Experiments on the DCASE 2023 Task 4 Challenge reveal that dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains, while CRNN+BEATs alone delivers the best results among individual SSL models. We further introduce normalized sound event bounding boxes (nSEBBs), an adaptive post-processing method that dynamically adjusts event boundary predictions, improving PSDS1 by up to 4% for standalone SSL models. These findings highlight the compatibility and complementarity of SSL architectures, providing guidance for task-specific fusion and robust SED system design.

Hanfang Cui、Longfei Song、Li Li、Dongxing Xu、Yanhua Long

计算技术、计算机技术

Hanfang Cui,Longfei Song,Li Li,Dongxing Xu,Yanhua Long.Exploring the Potential of SSL Models for Sound Event Detection[EB/OL].(2025-05-17)[2025-06-05].https://arxiv.org/abs/2505.11889.点此复制

评论