|国家预印本平台
首页|A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss

A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss

A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss

来源:Arxiv_logoArxiv
英文摘要

Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme upsampling (2 kHz to 48 kHz), reconstructing high frequencies effectively without noisy artifacts.

Tarikul Islam Tamiti、Biraj Joshi、Rida Hasan、Rashedul Hasan、Taieba Athay、Nursad Mamun、Anomadarshi Barua

通信无线通信

Tarikul Islam Tamiti,Biraj Joshi,Rida Hasan,Rashedul Hasan,Taieba Athay,Nursad Mamun,Anomadarshi Barua.A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss[EB/OL].(2025-06-30)[2025-07-16].https://arxiv.org/abs/2507.00229.点此复制

评论