|国家预印本平台
首页|VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

来源:Arxiv_logoArxiv
英文摘要

Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate.

Zhikang Niu、Hankun Wang、Yifan Yang、Kai Yu、Yiwei Guo、Hui Zhang、Shuai Wang、Chenpeng Du、Xie Chen

10.1109/ICASSP49660.2025.10890943

通信

Zhikang Niu,Hankun Wang,Yifan Yang,Kai Yu,Yiwei Guo,Hui Zhang,Shuai Wang,Chenpeng Du,Xie Chen.VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech[EB/OL].(2024-01-25)[2025-04-29].https://arxiv.org/abs/2401.14321.点此复制

评论