|国家预印本平台
首页|Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

来源:Arxiv_logoArxiv
英文摘要

This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.

Anis Hamadouche、Haifeng Luo、Mathini Sellathurai、Tharm Ratnarajah

通信无线通信

Anis Hamadouche,Haifeng Luo,Mathini Sellathurai,Tharm Ratnarajah.Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies[EB/OL].(2025-08-11)[2025-08-24].https://arxiv.org/abs/2508.08468.点此复制

评论