首页|A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

来源：

英文摘要

Recent advances in generative AI have led to remarkable interest in using systems that rely on large language models (LLMs) for practical applications. However, meaningful evaluation of these systems in real-world scenarios comes with a distinct set of challenges, which are not well-addressed by synthetic benchmarks and de-facto metrics that are often seen in the literature. We present a practical evaluation framework which outlines how to proactively curate representative datasets, select meaningful evaluation metrics, and employ meaningful evaluation methodologies that integrate well with practical development and deployment of LLM-reliant systems that must adhere to real-world requirements and meet user-facing needs.

作者：Ethan M. Rudd、Christopher Andrews、Philip Tully

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Ethan M. Rudd,Christopher Andrews,Philip Tully.A Practical Guide for Evaluating LLMs and LLM-Reliant Systems[EB/OL].(2025-06-15)[2025-06-23].https://arxiv.org/abs/2506.13023.点此复制

A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

评论