|国家预印本平台
首页|MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models

MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models

MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models

来源:Arxiv_logoArxiv
英文摘要

Clinical guidelines, typically structured as decision trees, are central to evidence-based medical practice and critical for ensuring safe and accurate diagnostic decision-making. However, it remains unclear whether Large Language Models (LLMs) can reliably follow such structured protocols. In this work, we introduce MedGUIDE, a new benchmark for evaluating LLMs on their ability to make guideline-consistent clinical decisions. MedGUIDE is constructed from 55 curated NCCN decision trees across 17 cancer types and uses clinical scenarios generated by LLMs to create a large pool of multiple-choice diagnostic questions. We apply a two-stage quality selection process, combining expert-labeled reward models and LLM-as-a-judge ensembles across ten clinical and linguistic criteria, to select 7,747 high-quality samples. We evaluate 25 LLMs spanning general-purpose, open-source, and medically specialized models, and find that even domain-specific LLMs often underperform on tasks requiring structured guideline adherence. We also test whether performance can be improved via in-context guideline inclusion or continued pretraining. Our findings underscore the importance of MedGUIDE in assessing whether LLMs can operate safely within the procedural frameworks expected in real-world clinical settings.

Xiaomin Li、Mingye Gao、Yuexing Hao、Taoran Li、Guangya Wan、Zihan Wang、Yijun Wang

医学研究方法肿瘤学临床医学

Xiaomin Li,Mingye Gao,Yuexing Hao,Taoran Li,Guangya Wan,Zihan Wang,Yijun Wang.MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models[EB/OL].(2025-05-16)[2025-07-01].https://arxiv.org/abs/2505.11613.点此复制

评论