首页|Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments

Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments

来源：

英文摘要

Large language models (LLMs) have been widely adopted in various downstream task domains. However, their abilities to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate the factuality of LLMs to retain medical knowledge. To address this challenge, we introduce the Medical Knowledge Judgment Dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized biomedical vocabularies and knowledge graphs. Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements, enabling direct measurement of their knowledge retention capabilities. Our experiments reveal that LLMs have difficulty accurately recalling medical facts, with performances varying substantially across semantic types and showing notable weakness in uncommon medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.

作者：Bryan Hooi、Yujun Cai、Yiwei Wang、Nanyun Peng、Jiaxi Li、Kai Zhang、Kai-Wei Chang、Jin Lu

作者单位：

学科分类：医药卫生理论医学研究方法

推荐引用：Bryan Hooi,Yujun Cai,Yiwei Wang,Nanyun Peng,Jiaxi Li,Kai Zhang,Kai-Wei Chang,Jin Lu.Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments[EB/OL].(2025-08-19)[2025-09-05].https://arxiv.org/abs/2502.14275.点此复制

Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments

Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments

评论