MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions
MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions
SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.
Georgios Chatzichristodoulou、Despoina Kosmopoulou、Antonios Kritikos、Anastasia Poulopoulou、Efthymios Georgiou、Athanasios Katsamanis、Vassilis Katsouros、Alexandros Potamianos
计算技术、计算机技术
Georgios Chatzichristodoulou,Despoina Kosmopoulou,Antonios Kritikos,Anastasia Poulopoulou,Efthymios Georgiou,Athanasios Katsamanis,Vassilis Katsouros,Alexandros Potamianos.MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions[EB/OL].(2025-06-11)[2025-06-24].https://arxiv.org/abs/2506.09556.点此复制
评论