首页|Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

来源：

英文摘要

Advanced long-context modeling backbone networks, such as Transformer, Conformer, and Mamba, have demonstrated state-of-the-art performance in speech enhancement. However, a systematic and comprehensive comparative study of these backbones within a unified speech enhancement framework remains lacking. In addition, xLSTM, a more recent and efficient variant of LSTM, has shown promising results in language modeling and as a general-purpose vision backbone. In this paper, we investigate the capability of xLSTM in speech enhancement, and conduct a comprehensive comparison and analysis of the Transformer, Conformer, Mamba, and xLSTM backbones within a unified framework, considering both causal and noncausal configurations. Overall, xLSTM and Mamba achieve better performance than Transformer and Conformer. Mamba demonstrates significantly superior training and inference efficiency, particularly for long speech inputs, whereas xLSTM suffers from the slowest processing speed.

作者：Qiquan Zhang、Moran Chen、Zeyang Song、Hexin Liu、Xiangyu Zhang、Haizhou Li

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Qiquan Zhang,Moran Chen,Zeyang Song,Hexin Liu,Xiangyu Zhang,Haizhou Li.Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study[EB/OL].(2025-07-06)[2025-07-21].https://arxiv.org/abs/2507.04368.点此复制

Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

Long-Context Modeling Networks for Monaural Speech Enhancement: A Comparative Study

评论