首页|GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

来源：

英文摘要

Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.

作者：Lingxiao Diao、Xinyue Xu、Wanxuan Sun、Cheng Yang、Zhuosheng Zhang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Lingxiao Diao,Xinyue Xu,Wanxuan Sun,Cheng Yang,Zhuosheng Zhang.GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents[EB/OL].(2025-05-16)[2025-07-16].https://arxiv.org/abs/2505.11368.点此复制

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

评论