|国家预印本平台
首页|LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

来源:Arxiv_logoArxiv
英文摘要

Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $\textbf{LLM-BabyBench}$, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ($\textbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ($\textbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ($\textbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ($\texttt{LLM-BabyBench-Predict}$, $\texttt{-Plan}$, $\texttt{-Decompose}$) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ($\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}$, $\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}$).

Omar Choukrani、Idriss Malek、Daniil Orel、Zhuohan Xie、Zangir Iklassov、Martin Taká?、Salem Lahlou

计算技术、计算机技术

Omar Choukrani,Idriss Malek,Daniil Orel,Zhuohan Xie,Zangir Iklassov,Martin Taká?,Salem Lahlou.LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs[EB/OL].(2025-05-17)[2025-06-13].https://arxiv.org/abs/2505.12135.点此复制

评论