基于双阶段信息补充的语音Tokenizer语义增强方法
A Semantic Enhancement Method for Speech Tokenizer Based on Two-Stage Information Supplementary
刘泽轩 1刘刚1
作者信息
- 1. 北京邮电大学人工智能学院,北京 100876
- 折叠
摘要
语音Tokenizer作为语音合成技术的核心模块,其特征离散化过程的语义保留能力直接决定合成语音的自然度、可懂度与情感一致性。针对传统语音Tokenizer在语义捕捉与量化过程中存在的语义信息丢失、语义与声学特征难以兼顾等问题,本文提出一种基于双阶段信息补充的语义增强方法。该方法以残差向量量化(RVQ)为核心枢纽,构建量化前置与量化后置的闭环协同机制:前置阶段引入预训练自监督学习(SSL)模型WavLM提取高质量语义特征,经线性投影适配后与声学特征深度融合,生成兼具语义与声学信息的双模态统一特征;后置阶段设计语义感知损失函数,通过高层语义映射与余弦相似度度量,强制量化过程保留核心语义信息,实现对RVQ量化模块的精准训练引导。为验证所提方法的有效性,设计多组对比实验与消融实验,实验结果表明,所提方法(简称SemanticTokenizer)在语音量化重构、大模型语音合成、跨说话人语义稳定性及语义文本对齐性任务中均显著优于EnCodec、SoundStream等主流基线模型。其中,在大模型语音合成任务中,词错误率(WER)降至3.66%;跨说话人语义稳定性实验中,平均编辑距离仅为12.6。研究结果证明,该双阶段语义增强方法可有效解决传统Tokenizer语义捕捉困难与量化信息丢失两大核心挑战,为语音合成Tokenizer的语义增强提供了新范式,对推动语音合成技术向专业复杂场景深度落地具有重要意义。
Abstract
As the core module of speech synthesis technology, the semantic retention ability of speech Tokenizer in the feature discretization process directly determines the naturalness, intelligibility and emotional consistency of synthesized speech. Aiming at the problems of semantic information loss and difficulty in balancing semantics and acoustic features existing in traditional speech Tokenizer during semantic capture and quantization, this paper proposes a semantic enhancement method based on two-stage information supplementary. With Residual Vector Quantization (RVQ) as the core hub, this method constructs a closed-loop collaborative mechanism of pre-quantization and post-quantization: in the pre-quantization stage, a pre-trained Self-Supervised Learning (SSL) model WavLM is introduced to extract high-quality semantic features, which are adaptively adjusted by linear projection and then deeply fused with acoustic features to generate a bimodal unified feature with both semantic and acoustic information; in the post-quantization stage, a semantic-aware loss function is designed to force the quantization process to retain core semantic information through high-level semantic mapping and cosine similarity measurement, realizing precise training guidance for the RVQ quantization module. To verify the effectiveness of the proposed method, multiple sets of comparative experiments and ablation experiments are designed. The experimental results show that the proposed method (abbreviated as SemanticTokenizer) is significantly superior to mainstream baseline models such as EnCodec and SoundStream in tasks of speech quantization reconstruction, large-model speech synthesis, cross-speaker semantic stability and semantic-text alignment. Among them, in the large-model speech synthesis task, the Word Error Rate (WER) is reduced to 3.66%, which is 38.2% lower than that of EnCodec; in the cross-speaker semantic stability experiment, the average edit distance is only 12.6, which is 73.4% lower than that of EnCodec. The research results prove that the two-stage semantic enhancement method can effectively solve the two core challenges of traditional Tokenizer, i.e., difficulty in semantic capture and loss of quantization information, providing a new paradigm for the semantic enhancement of speech synthesis Tokenizer, and is of great significance for promoting the in-depth application of speech synthesis technology in professional and complex scenarios.关键词
人工智能/语音合成/残差向量量化/自监督学习/损失引导Key words
Artificial Intelligence/Speech Synthesis/Residual Vector Quantization/Self-Supervised Learning/Loss Guidance引用本文复制引用
刘泽轩,刘刚.基于双阶段信息补充的语音Tokenizer语义增强方法[EB/OL].(2026-03-19)[2026-03-20].http://www.paper.edu.cn/releasepaper/content/202603-186.学科分类
计算技术、计算机技术
评论