|国家预印本平台
首页|Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

来源:Arxiv_logoArxiv
英文摘要

Advanced long-context modeling backbone networks, such as Transformer, Conformer, and Mamba, have demonstrated state-of-the-art performance in speech enhancement. However, a systematic and comprehensive comparative study of these backbones within a unified speech enhancement framework remains lacking. In addition, xLSTM, a more recent and efficient variant of LSTM, has shown promising results in language modeling and as a general-purpose vision backbone. In this paper, we investigate the capability of xLSTM in speech enhancement, and conduct a comprehensive comparison and analysis of the Transformer, Conformer, Mamba, and xLSTM backbones within a unified framework, considering both causal and noncausal configurations. Overall, xLSTM and Mamba achieve better performance than Transformer and Conformer. Mamba demonstrates significantly superior training and inference efficiency, particularly for long speech inputs, whereas xLSTM suffers from the slowest processing speed.

Qiquan Zhang、Moran Chen、Zeyang Song、Hexin Liu、Xiangyu Zhang、Haizhou Li

计算技术、计算机技术

Qiquan Zhang,Moran Chen,Zeyang Song,Hexin Liu,Xiangyu Zhang,Haizhou Li.Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study[EB/OL].(2025-07-06)[2025-07-21].https://arxiv.org/abs/2507.04368.点此复制

评论