首页|InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

来源：

英文摘要

In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.

作者：Kexin Huang、Qian Tu、Liwei Fan、Chenchen Yang、Dong Zhang、Shimin Li、Zhaoye Fei、Qinyuan Cheng、Xipeng Qiu

作者单位：

学科分类：计算技术、计算机技术自动化技术、自动化技术设备常用外国语汉语

推荐引用：Kexin Huang,Qian Tu,Liwei Fan,Chenchen Yang,Dong Zhang,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu.InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems[EB/OL].(2025-06-19)[2025-07-03].https://arxiv.org/abs/2506.16381.点此复制

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

评论