首页|Human-CLAP: Human-perception-based contrastive language-audio pretraining

Human-CLAP: Human-perception-based contrastive language-audio pretraining

来源：

英文摘要

Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.

作者：Taisei Takano、Yuki Okamoto、Yusuke Kanamori、Yuki Saito、Ryotaro Nagase、Hiroshi Saruwatari

作者单位：

学科分类：通信电子技术应用

推荐引用：Taisei Takano,Yuki Okamoto,Yusuke Kanamori,Yuki Saito,Ryotaro Nagase,Hiroshi Saruwatari.Human-CLAP: Human-perception-based contrastive language-audio pretraining[EB/OL].(2025-07-12)[2025-07-25].https://arxiv.org/abs/2506.23553.点此复制

Human-CLAP: Human-perception-based contrastive language-audio pretraining

Human-CLAP: Human-perception-based contrastive language-audio pretraining

评论