|国家预印本平台
首页|On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

来源:Arxiv_logoArxiv
英文摘要

Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.

Hyouin Liu、Zhikuan Zhang

计算技术、计算机技术

Hyouin Liu,Zhikuan Zhang.On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud[EB/OL].(2025-05-11)[2025-06-19].https://arxiv.org/abs/2505.07202.点此复制

评论