首页|Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities

Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities

来源：

Arxiv

英文摘要

Minimally invasive surgery (MIS) presents significant visual challenges, including a limited field of view, specular reflections, and inconsistent lighting conditions due to the small incision and the use of endoscopes. Over the past decade, many machine learning and deep learning models have been developed to identify and detect instruments and anatomical structures in surgical videos. However, these models are typically trained on manually labeled, procedure- and task-specific datasets that are relatively small, resulting in limited generalization to unseen data.In practice, hospitals generate a massive amount of raw surgical data every day, including videos captured during various procedures. Labeling this data is almost impractical, as it requires highly specialized expertise. The recent success of vision-language models (VLMs), which can be trained on large volumes of raw image-text pairs and exhibit strong adaptability, offers a promising alternative for leveraging unlabeled surgical data. While some existing work has explored applying VLMs to surgical tasks, their performance remains limited. To support future research in developing more effective VLMs for surgical applications, this paper aims to answer a key question: How well do existing VLMs, both general-purpose and surgery-specific perform on surgical data, and what types of scenes do they struggle with? To address this, we conduct a benchmarking study of several popular VLMs across diverse laparoscopic datasets. Specifically, we visualize the model's attention to identify which regions of the image it focuses on when making predictions for surgical tasks. We also propose a metric to evaluate whether the model attends to task-relevant regions. Our findings reveal a mismatch between prediction accuracy and visual grounding, indicating that models may make correct predictions while focusing on irrelevant areas of the image.

作者：Jiajun Cheng、Xianwu Zhao、Shan Lin

作者单位：

学科分类：医学研究方法医学现状、医学发展

推荐引用：Jiajun Cheng,Xianwu Zhao,Shan Lin.Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities[EB/OL].(2025-05-15)[2025-06-04].https://arxiv.org/abs/2505.10764.点此复制

Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities

Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities

评论