Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask
Large Language Models are a promising tool for automated vulnerability detection, thanks to their success in code generation and repair. However, despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities? Current evaluations, which often assess models on isolated functions or files, ignore the broader execution and data-flow context essential for understanding vulnerabilities. This oversight leads to two types of misleading outcomes: incorrect conclusions and flawed rationales, collectively undermining the reliability of prior assessments. Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations. To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically incorporates contextual information into LLM-based vulnerability detection. We construct a context-rich dataset of 2,000 vulnerable-patched program pairs spanning 99 CWEs and evaluate 13 LLMs across four model families. Our framework elicits both binary predictions and natural-language rationales, which are further validated using LLM-as-a-judge techniques. Our findings overturn existing misconceptions. When provided with sufficient context, SOTA LLMs achieve significantly improved performance (e.g., 0.7 F1-score on key CWEs), with 0.8 precision. We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.
Yue Li、Xiao Li、Hao Wu、Minghui Xu、Yue Zhang、Xiuzhen Cheng、Fengyuan Xu、Sheng Zhong
计算技术、计算机技术
Yue Li,Xiao Li,Hao Wu,Minghui Xu,Yue Zhang,Xiuzhen Cheng,Fengyuan Xu,Sheng Zhong.Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask[EB/OL].(2025-04-18)[2025-04-28].https://arxiv.org/abs/2504.13474.点此复制
评论