首页|ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

来源：

英文摘要

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce TOOLSCAN, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using TOOLSCAN, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.

作者：Akshara Prabhakar、Juntao Tan、Huan Wang、Silivo Savarese、Thai Hoang、Tulika Awalgaonkar、Tian Lan、Liangwei Yang、Jianguo Zhang、Rithesh Murthy、Weiran Yao、Zhiwei Liu、Juan Carlos Niebles、Shelby Heinecke、Caiming Xiong、Shirley Kokane、Ming Zhu、Zuxin Liu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Akshara Prabhakar,Juntao Tan,Huan Wang,Silivo Savarese,Thai Hoang,Tulika Awalgaonkar,Tian Lan,Liangwei Yang,Jianguo Zhang,Rithesh Murthy,Weiran Yao,Zhiwei Liu,Juan Carlos Niebles,Shelby Heinecke,Caiming Xiong,Shirley Kokane,Ming Zhu,Zuxin Liu.ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs[EB/OL].(2025-06-26)[2025-07-16].https://arxiv.org/abs/2411.13547.点此复制

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

评论