ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.
Han Zhu、Wei Kang、Liyong Guo、Zengwei Yao、Fangjun Kuang、Weiji Zhuang、Zhaoqing Li、Zhifeng Han、Dong Zhang、Xin Zhang、Xingchen Song、Long Lin、Daniel Povey
通信
Han Zhu,Wei Kang,Liyong Guo,Zengwei Yao,Fangjun Kuang,Weiji Zhuang,Zhaoqing Li,Zhifeng Han,Dong Zhang,Xin Zhang,Xingchen Song,Long Lin,Daniel Povey.ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching[EB/OL].(2025-07-12)[2025-07-25].https://arxiv.org/abs/2507.09318.点此复制
评论