CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
Asaf Yehudai、Lilach Eden、Yotam Perlitz、Roy Bar-Haim、Michal Shmueli-Scheuer
计算技术、计算机技术
Asaf Yehudai,Lilach Eden,Yotam Perlitz,Roy Bar-Haim,Michal Shmueli-Scheuer.CLEAR: Error Analysis via LLM-as-a-Judge Made Easy[EB/OL].(2025-07-24)[2025-08-10].https://arxiv.org/abs/2507.18392.点此复制
评论