首页|Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

来源：

英文摘要

We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.

作者：Wonjune Kang、Deb Roy

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Wonjune Kang,Deb Roy.Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style[EB/OL].(2025-08-15)[2025-08-28].https://arxiv.org/abs/2508.11187.点此复制

Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

评论