|国家预印本平台
首页|CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes

CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes

CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes

来源:Arxiv_logoArxiv
英文摘要

Sound event localization and detection (SELD) is a task for the classification of sound events and the identification of direction of arrival (DoA) utilizing multichannel acoustic signals. For effective classification and localization, a channel-spectro-temporal transformer (CST-former) was suggested. CST-former employs multidimensional attention mechanisms across the spatial, spectral, and temporal domains to enlarge the model's capacity to learn the domain information essential for event detection and DoA estimation over time. In this work, we present an enhanced version of CST-former with multiscale unfolded local embedding (MSULE) developed to capture and aggregate domain information over multiple time-frequency scales. Also, we propose finetuning and post-processing techniques beneficial for conducting the SELD task over limited training datasets. In-depth ablation studies of the proposed architecture and detailed analysis on the proposed modules are carried out to validate the efficacy of multidimensional attentions on the SELD task. Empirical validation through experimentation on STARSS22 and STARSS23 datasets demonstrates the remarkable performance of CST-former and post-processing techniques without using external data.

Yusun Shul、Dayun Choi、Jung-Woo Choi

无线电设备、电信设备通信无线通信电子技术应用

Yusun Shul,Dayun Choi,Jung-Woo Choi.CST-former: Multidimensional Attention-based Transformer for Sound Event Localization and Detection in Real Scenes[EB/OL].(2025-04-17)[2025-04-28].https://arxiv.org/abs/2504.12870.点此复制

评论