SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.
Yuxuan Wang、Xianzhao Chen、Chao Zhang、Lu Lu、Jun Zhang、Siyin Wang、Guangzhi Sun、Xiaohai Tian、Wenyi Yu、Xiaoyu Yang
通信计算技术、计算机技术无线通信
Yuxuan Wang,Xianzhao Chen,Chao Zhang,Lu Lu,Jun Zhang,Siyin Wang,Guangzhi Sun,Xiaohai Tian,Wenyi Yu,Xiaoyu Yang.SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation[EB/OL].(2024-11-27)[2025-06-30].https://arxiv.org/abs/2411.18138.点此复制
评论