|国家预印本平台
首页|Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

来源:Arxiv_logoArxiv
英文摘要

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

Zexu Pan、Shengkui Zhao、Tingting Wang、Kun Zhou、Yukun Ma、Chong Zhang、Bin Ma

计算技术、计算机技术

Zexu Pan,Shengkui Zhao,Tingting Wang,Kun Zhou,Yukun Ma,Chong Zhang,Bin Ma.Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction[EB/OL].(2025-05-26)[2025-06-16].https://arxiv.org/abs/2505.20635.点此复制

评论