Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization
Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization
State-of-the-art transformer models for Speech Emotion Recognition (SER) rely on temporal feature aggregation, yet advanced pooling methods remain underexplored. We systematically benchmark pooling strategies, including Multi-Query Multi-Head Attentive Statistics Pooling, which achieves a 3.5 percentage point macro F1 gain over average pooling. Attention analysis shows 15 percent of frames capture 80 percent of emotion cues, revealing a localized pattern of emotional information. Analysis of high-attention frames reveals that non-linguistic vocalizations and hyperarticulated phonemes are disproportionately prioritized during pooling, mirroring human perceptual strategies. Our findings position attentive pooling as both a performant SER mechanism and a biologically plausible tool for explainable emotion localization. On Interspeech 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, our approach obtained a macro F1 score of 0.3649.
Tahitoa Leygue、Astrid Sabourin、Christian Bolzmacher、Sylvain Bouchigny、Margarita Anastassova、Quoc-Cuong Pham
计算技术、计算机技术
Tahitoa Leygue,Astrid Sabourin,Christian Bolzmacher,Sylvain Bouchigny,Margarita Anastassova,Quoc-Cuong Pham.Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization[EB/OL].(2025-06-18)[2025-07-16].https://arxiv.org/abs/2506.15754.点此复制
评论