|国家预印本平台
首页|FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

来源:Arxiv_logoArxiv
英文摘要

To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

Shujie Liu、Yanqing Liu、Jiaming Zhou、Yong Qin、Hui Wang、Shiwan Zhao、Haiyang Sun、Yan Lu、Haoqin Sun、Lingwei Meng、Yifan Yang、Jinyu Li

计算技术、计算机技术

Shujie Liu,Yanqing Liu,Jiaming Zhou,Yong Qin,Hui Wang,Shiwan Zhao,Haiyang Sun,Yan Lu,Haoqin Sun,Lingwei Meng,Yifan Yang,Jinyu Li.FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching[EB/OL].(2025-02-16)[2025-08-02].https://arxiv.org/abs/2502.11128.点此复制

评论