首页|Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

来源：

英文摘要

This paper presents the system developed for Task 1 of the Multi-modal Information-based Speech Processing (MISP) 2025 Challenge. We introduce CASA-Net, an embedding fusion method designed for end-to-end audio-visual speaker diarization (AVSD) systems. CASA-Net incorporates a cross-attention (CA) module to effectively capture cross-modal interactions in audio-visual signals and employs a self-attention (SA) module to learn contextual relationships among audio-visual frames. To further enhance performance, we adopt a training strategy that integrates pseudo-label refinement and retraining, improving the accuracy of timestamp predictions. Additionally, median filtering and overlap averaging are applied as post-processing techniques to eliminate outliers and smooth prediction labels. Our system achieved a diarization error rate (DER) of 8.18% on the evaluation set, representing a relative improvement of 47.3% over the baseline DER of 15.52%.

作者：Zhaoyang Li、Haodong Zhou、Longjie Luo、Xiaoxiao Li、Yongxin Chen、Lin Li、Qingyang Hong

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Zhaoyang Li,Haodong Zhou,Longjie Luo,Xiaoxiao Li,Yongxin Chen,Lin Li,Qingyang Hong.Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge[EB/OL].(2025-06-03)[2025-07-25].https://arxiv.org/abs/2506.02621.点此复制

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting Challenge

评论