100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
Wang Yang、Hongye Jin、Shaochen Zhong、Song Jiang、Qifan Wang、Vipin Chaudhary、Xiaotian Han
计算技术、计算机技术
Wang Yang,Hongye Jin,Shaochen Zhong,Song Jiang,Qifan Wang,Vipin Chaudhary,Xiaotian Han.100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?[EB/OL].(2025-05-25)[2025-07-24].https://arxiv.org/abs/2505.19293.点此复制
评论