Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore, EPSS performs well on E2 TTS models, demonstrating its strong generalization ability.
Qixi Zheng、Yushen Chen、Zhikang Niu、Ziyang Ma、Xiaofei Wang、Kai Yu、Xie Chen
计算技术、计算机技术
Qixi Zheng,Yushen Chen,Zhikang Niu,Ziyang Ma,Xiaofei Wang,Kai Yu,Xie Chen.Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling[EB/OL].(2025-05-26)[2025-06-24].https://arxiv.org/abs/2505.19931.点此复制
评论