首页|Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

来源：

Arxiv

英文摘要

Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.

作者：Tanghaoran Zhang、Bo Lin、Kamal Al-Sabahi、Zhang Zhang、Yihao Qin、Yao Lu、Kang Yang、Xinjun Mao、Shangwen Wang、Yanlin Wang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Tanghaoran Zhang,Bo Lin,Kamal Al-Sabahi,Zhang Zhang,Yihao Qin,Yao Lu,Kang Yang,Xinjun Mao,Shangwen Wang,Yanlin Wang.Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks[EB/OL].(2025-04-27)[2025-05-06].https://arxiv.org/abs/2504.19444.点此复制

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

评论