A Survey of Automatic Hallucination Evaluation on Natural Language Generation
A Survey of Automatic Hallucination Evaluation on Natural Language Generation
The proliferation of Large Language Models (LLMs) has introduced a critical challenge: accurate hallucination evaluation that ensures model reliability. While Automatic Hallucination Evaluation (AHE) has emerged as essential, the field suffers from methodological fragmentation, hindering both theoretical understanding and practical advancement. This survey addresses this critical gap through a comprehensive analysis of 74 evaluation methods, revealing that 74% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a unified evaluation pipeline encompassing datasets and benchmarks, evidence collection strategies, and comparison mechanisms, systematically documenting the evolution from pre-LLM to post-LLM methodologies. Beyond taxonomical organization, we identify fundamental limitations in current approaches and their implications for real-world deployment. To guide future research, we delineate key challenges and propose strategic directions, including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, ultimately providing a roadmap for developing more robust and practical hallucination evaluation systems.
Siya Qi、Lin Gui、Yulan He、Zheng Yuan
计算技术、计算机技术
Siya Qi,Lin Gui,Yulan He,Zheng Yuan.A Survey of Automatic Hallucination Evaluation on Natural Language Generation[EB/OL].(2025-06-19)[2025-06-29].https://arxiv.org/abs/2404.12041.点此复制
评论