首页|Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

来源：

英文摘要

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

作者：Zexu Pan、Shengkui Zhao、Tingting Wang、Kun Zhou、Yukun Ma、Chong Zhang、Bin Ma

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Zexu Pan,Shengkui Zhao,Tingting Wang,Kun Zhou,Yukun Ma,Chong Zhang,Bin Ma.Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction[EB/OL].(2025-05-26)[2025-06-16].https://arxiv.org/abs/2505.20635.点此复制

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

评论