On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud
On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud
Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.
Hyouin Liu、Zhikuan Zhang
计算技术、计算机技术
Hyouin Liu,Zhikuan Zhang.On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud[EB/OL].(2025-05-11)[2025-06-19].https://arxiv.org/abs/2505.07202.点此复制
评论