|国家预印本平台
| 注册
首页|Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

来源:Arxiv_logoArxiv
英文摘要

Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.

Zhifei Xie、Ziyang Ma、Zihang Liu、Kaiyu Pang、Hongyu Li、Jialin Zhang、Yue Liao、Deheng Ye、Chunyan Miao、Shuicheng Yan

计算技术、计算机技术

Zhifei Xie,Ziyang Ma,Zihang Liu,Kaiyu Pang,Hongyu Li,Jialin Zhang,Yue Liao,Deheng Ye,Chunyan Miao,Shuicheng Yan.Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models[EB/OL].(2025-08-18)[2025-09-06].https://arxiv.org/abs/2508.15827.点此复制

评论