首页|S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

来源：

英文摘要

End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

作者：Yuanbo Fang、Haoze Sun、Jun Liu、Tao Zhang、Zenan Zhou、Weipeng Chen、Xiaofen Xing、Xiangmin Xu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Yuanbo Fang,Haoze Sun,Jun Liu,Tao Zhang,Zenan Zhou,Weipeng Chen,Xiaofen Xing,Xiangmin Xu.S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models[EB/OL].(2025-05-20)[2025-07-16].https://arxiv.org/abs/2505.14438.点此复制

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

评论