A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian
Ana-Cristina Rogoz Radu Tudor Ionescu Alexandra-Valentina Anghel Ionut-Lucian Antone-Iordache Simona Coniac Andreea Iuliana Ionescu
作者信息
Abstract
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs about cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring both keyword extraction and reasoning. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios: zero-shot prompting and supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian.引用本文复制引用
Ana-Cristina Rogoz,Radu Tudor Ionescu,Alexandra-Valentina Anghel,Ionut-Lucian Antone-Iordache,Simona Coniac,Andreea Iuliana Ionescu.A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian[EB/OL].(2026-02-12)[2026-03-05].https://arxiv.org/abs/2508.16390.学科分类
医药卫生理论/医学研究方法/常用外国语
评论