Bringing Interpretability to Neural Audio Codecs
Bringing Interpretability to Neural Audio Codecs
The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In contrast to semantic units, acoustic units may lack interpretability because their training objectives primarily focus on reconstruction performance. This paper proposes a two-step approach to explore the encoding of speech information within the codec tokens. The primary goal of the analysis stage is to gain deeper insight into how speech attributes such as content, identity, and pitch are encoded. The synthesis stage then trains an AnCoGen network for post-hoc explanation of codecs to extract speech attributes from the respective tokens directly.
Samir Sadok、Julien Hauret、éric Bavu
通信
Samir Sadok,Julien Hauret,éric Bavu.Bringing Interpretability to Neural Audio Codecs[EB/OL].(2025-06-04)[2025-06-16].https://arxiv.org/abs/2506.04492.点此复制
评论