Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots
Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots
Humanoid robots require simultaneous sound event type and direction estimation for situational awareness, but conventional two-channel input struggles with elevation estimation and front-back confusion. This paper proposes a binaural sound event localization and detection (BiSELD) neural network to address these challenges. BiSELDnet learns time-frequency patterns and head-related transfer function (HRTF) localization cues from binaural input features. A novel eight-channel binaural time-frequency feature (BTFF) is introduced, comprising left/right mel-spectrograms, V-maps, an interaural time difference (ITD) map (below 1.5 kHz), an interaural level difference (ILD) map (above 5 kHz with front-back asymmetry), and spectral cue (SC) maps (above 5 kHz for elevation). The effectiveness of BTFF was confirmed across omnidirectional, horizontal, and median planes. BiSELDnets, particularly one based on the efficient Trinity module, were implemented to output time series of direction vectors for each sound event class, enabling simultaneous detection and localization. Vector activation map (VAM) visualization was proposed to analyze network learning, confirming BiSELDnet's focus on the N1 notch frequency for elevation estimation. Comparative evaluations under urban background noise conditions demonstrated that the proposed BiSELD model significantly outperforms state-of-the-art (SOTA) SELD models with binaural input.
Gyeong-Tae Lee
无线电设备、电信设备通信无线通信电子技术应用
Gyeong-Tae Lee.Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots[EB/OL].(2025-08-06)[2025-08-16].https://arxiv.org/abs/2508.04333.点此复制
评论