首页|RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

来源：

英文摘要

The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability. The experimental results validate the effectiveness of RA-CLAP, providing valuable reference in ESSD.

作者：Haoqin Sun、Jingguang Tian、Jiaming Zhou、Hui Wang、Jiabei He、Shiwan Zhao、Xiangyu Kong、Desheng Hu、Xinkang Xu、Xinhui Hu、Yong Qin

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Haoqin Sun,Jingguang Tian,Jiaming Zhou,Hui Wang,Jiabei He,Shiwan Zhao,Xiangyu Kong,Desheng Hu,Xinkang Xu,Xinhui Hu,Yong Qin.RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval[EB/OL].(2025-05-25)[2025-07-25].https://arxiv.org/abs/2505.19437.点此复制

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

评论