AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Autonomous agents built on language models (LMs) are showing increasing popularity in many fields, including scientific research. AI co-scientists aim to support or automate parts of the research process using these agents. A key component of empirical AI research is the design of ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 29% of the original ablations on average. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms the currently existing agent-based approach.
Talor Abramovich、Gal Chechik
计算技术、计算机技术
Talor Abramovich,Gal Chechik.AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research[EB/OL].(2025-07-09)[2025-08-02].https://arxiv.org/abs/2507.08038.点此复制
评论