Empirical Evaluation of Large Language Models in Automated Program Repair
Empirical Evaluation of Large Language Models in Automated Program Repair
The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models and Java benchmarks. The repair capabilities of modern, large-scale LLMs across diverse languages and scenarios remain underexplored. To address this, we conduct a comprehensive empirical study of four open-source LLMs, CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder, spanning 7B to 33B parameters, diverse architectures, and purposes. We evaluate them across two bug scenarios (enterprise-grades and algorithmic), three languages (Java, C/C++, Python), and four prompting strategies, analyzing over 600K generated patches on six benchmarks. Key findings include: (1) model specialization (e.g., CodeLlama) can outperform larger general-purpose models (e.g., LLaMA); (2) repair performance does not scale linearly with model size; (3) correct patches often appear early in generation; and (4) prompts significantly affect results. These insights offer practical guidance for designing effective and efficient LLM-based APR systems.
Jiajun Sun、Fengjie Li、Xinzhu Qi、Hongyu Zhang、Jiajun Jiang
计算技术、计算机技术
Jiajun Sun,Fengjie Li,Xinzhu Qi,Hongyu Zhang,Jiajun Jiang.Empirical Evaluation of Large Language Models in Automated Program Repair[EB/OL].(2025-06-16)[2025-08-02].https://arxiv.org/abs/2506.13186.点此复制
评论