首页|Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

来源：

英文摘要

Large Language Models (LLMs) are increasingly employed to automatically refactor unit tests, aiming to enhance readability, naming, and structural clarity while preserving functional behavior. However, evaluating such refactorings remains challenging: traditional metrics like CodeBLEU are overly sensitive to renaming and structural edits, whereas embedding-based similarities capture semantics but ignore readability and modularity. We introduce CTSES, a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior preservation, lexical quality, and structural alignment. CTSES is evaluated on over 5,000 test suites automatically refactored by GPT-4o and Mistral-Large-2407, using Chain-of-Thought prompting, across two established Java benchmarks: Defects4J and SF110. Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.

作者：Wendk?uni C. Ouédraogo、Yinghua Li、Xueqi Dang、Xin Zhou、Anil Koyuncu、Jacques Klein、David Lo、Tegawendé F. Bissyandé

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Wendk?uni C. Ouédraogo,Yinghua Li,Xueqi Dang,Xin Zhou,Anil Koyuncu,Jacques Klein,David Lo,Tegawendé F. Bissyandé.Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness[EB/OL].(2025-06-07)[2025-06-17].https://arxiv.org/abs/2506.06767.点此复制

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

评论