CODEMENV: Benchmarking Large Language Models on Code Migration
CODEMENV: Benchmarking Large Language Models on Code Migration
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.
Keyuan Cheng、Xudong Shen、Yihao Yang、Tengyue Wang、Yang Cao、Muhammad Asif Ali、Hanbin Wang、Lijie Hu、Di Wang
计算技术、计算机技术
Keyuan Cheng,Xudong Shen,Yihao Yang,Tengyue Wang,Yang Cao,Muhammad Asif Ali,Hanbin Wang,Lijie Hu,Di Wang.CODEMENV: Benchmarking Large Language Models on Code Migration[EB/OL].(2025-06-01)[2025-06-17].https://arxiv.org/abs/2506.00894.点此复制
评论