首页|CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

来源：

英文摘要

Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.

作者：Daiki Takeuchi、Binh Thien Nguyen、Masahiro Yasuda、Yasunori Ohishi、Daisuke Niizumi、Noboru Harada

作者单位：

学科分类：计算技术、计算机技术通信

推荐引用：Daiki Takeuchi,Binh Thien Nguyen,Masahiro Yasuda,Yasunori Ohishi,Daisuke Niizumi,Noboru Harada.CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer[EB/OL].(2025-05-31)[2025-07-09].https://arxiv.org/abs/2506.00800.点此复制

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

评论