融合角色、结构和语义的口语对话预训练语言模型
口语语言理解是任务式对话系统的重要组件,预训练语言模型在口语语言理解中取得了重要突破。然而,目前这些预训练语言模型,大多是基于大规模书面文本语料。考虑到口语与书面语在结构、使用条件和表达方式上的明显差异,构建了大规模、双角色、多轮次、口语对话语料,并提出融合角色、结构和语义的四个自监督预训练任务:全词掩码,角色预测,话语内部反转预测和轮次间互换预测,通过多任务联合训练面向口语的预训练语言模型SPD-BERT:SPoken Dialog-BERT。在金融领域智能客服场景的三个人工标注数据集:意图识别、实体识别和拼音纠错上进行详细的实验测试,实验结果表明该语言模型的有效性。
Spoken language understanding (SLU) is an important component of dialog system. Recently, pre-trained language model has made breakthrough in various tasks of spoken language understanding. However, these language models are trained with large-scale written language, which are quite different from spoken language in structure, condition and expression pattern. This paper construct large-scale multi-turn bi-role spoken dialog corpus. Then four self-supervised pre-trained tasks are proposed: masked language model, role prediction, intra-query reverse prediction and inter-query exchange prediction. A bert-based spoken dialog language model (SPD-BERT) is pre-trained through multi-task learning. Finally, the model is tested with three typical tasks of intelligent customer service in finance domain. The experiment results demonstrates the effectiveness of out model.
李锋、黄健
语言学
对话系统口语语言理解预训练语言模型意图识别实体识别
李锋,黄健.融合角色、结构和语义的口语对话预训练语言模型[EB/OL].(2022-04-07)[2025-08-06].https://chinaxiv.org/abs/202204.00048.点此复制
评论