|国家预印本平台
| 注册
首页|MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz Radu Tudor Ionescu Alexandra-Valentina Anghel Ionut-Lucian Antone-Iordache Simona Coniac Andreea Iuliana Ionescu

Arxiv_logoArxiv

MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian

Ana-Cristina Rogoz Radu Tudor Ionescu Alexandra-Valentina Anghel Ionut-Lucian Antone-Iordache Simona Coniac Andreea Iuliana Ionescu

作者信息

Abstract

Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

引用本文复制引用

Ana-Cristina Rogoz,Radu Tudor Ionescu,Alexandra-Valentina Anghel,Ionut-Lucian Antone-Iordache,Simona Coniac,Andreea Iuliana Ionescu.MedQARo: A Large-Scale Benchmark for Evaluating Large Language Models on Medical Question Answering in Romanian[EB/OL].(2025-12-31)[2026-02-08].https://arxiv.org/abs/2508.16390.

学科分类

医药卫生理论/医学现状、医学发展/医学研究方法/肿瘤学/常用外国语

评论

首发时间 2025-12-31
下载量:0
|
点击量:84
段落导航相关论文