首页|LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

来源：

英文摘要

Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.

作者：Hyang Cui

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Hyang Cui.LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods[EB/OL].(2025-05-21)[2025-06-15].https://arxiv.org/abs/2505.16129.点此复制

LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

评论