Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification
Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification
Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.
Vitaly Protasov、Nikolay Babakov、Daryna Dementieva、Alexander Panchenko
印欧语系汉藏语系闪-含语系(阿非罗-亚细亚语系)南亚语系(澳斯特罗-亚细亚语系)非洲诸语言
Vitaly Protasov,Nikolay Babakov,Daryna Dementieva,Alexander Panchenko.Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification[EB/OL].(2025-07-21)[2025-08-10].https://arxiv.org/abs/2507.15557.点此复制
评论