RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval
RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval
The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability. The experimental results validate the effectiveness of RA-CLAP, providing valuable reference in ESSD.
Haoqin Sun、Jingguang Tian、Jiaming Zhou、Hui Wang、Jiabei He、Shiwan Zhao、Xiangyu Kong、Desheng Hu、Xinkang Xu、Xinhui Hu、Yong Qin
计算技术、计算机技术
Haoqin Sun,Jingguang Tian,Jiaming Zhou,Hui Wang,Jiabei He,Shiwan Zhao,Xiangyu Kong,Desheng Hu,Xinkang Xu,Xinhui Hu,Yong Qin.RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval[EB/OL].(2025-05-25)[2025-07-25].https://arxiv.org/abs/2505.19437.点此复制
评论