首页|Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications

来源：

英文摘要

Recent works have shown that modelling raw waveform directly from text in an end-to-end (E2E) fashion produces more natural-sounding speech than traditional neural text-to-speech (TTS) systems based on a cascade or two-stage approach. However, current E2E state-of-the-art models are computationally complex and memory-consuming, making them unsuitable for real-time offline on-device applications in low-resource scenarios. To address this issue, we propose a Lightweight E2E-TTS (LE2E) model that generates high-quality speech requiring minimal computational resources. We evaluate the proposed model on the LJSpeech dataset and show that it achieves state-of-the-art performance while being up to $90\%$ smaller in terms of model parameters and $10\times$ faster in real-time-factor. Furthermore, we demonstrate that the proposed E2E training paradigm achieves better quality compared to an equivalent architecture trained in a two-stage approach. Our results suggest that LE2E is a promising approach for developing real-time, high quality, low-resource TTS applications for on-device applications.

作者：Biel Tura Vecino、Adam Gabry?、Daniel M?twicki、Andrzej Pomirski、Tom Iddon、Marius Cotescu、Jaime Lorenzo-Trueba

作者单位：

DOI：10.21437/SSW.2023-35

学科分类：计算技术、计算机技术自动化技术、自动化技术设备

推荐引用：Biel Tura Vecino,Adam Gabry?,Daniel M?twicki,Andrzej Pomirski,Tom Iddon,Marius Cotescu,Jaime Lorenzo-Trueba.Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications[EB/OL].(2025-05-12)[2025-06-07].https://arxiv.org/abs/2505.07701.点此复制

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications

评论