首页|Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

来源：

英文摘要

Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt

作者：Soham Walimbe、Britty Baby、Vinkle Srivastav、Nicolas Padoy

作者单位：

学科分类：医学研究方法计算技术、计算机技术

推荐引用：Soham Walimbe,Britty Baby,Vinkle Srivastav,Nicolas Padoy.Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision[EB/OL].(2025-07-10)[2025-07-16].https://arxiv.org/abs/2507.05020.点此复制

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

评论