|国家预印本平台
首页|V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

来源:Arxiv_logoArxiv
英文摘要

Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.

Jiaxiang Liu、Shujian Gao、Bin Feng、Zhihang Tang、Xiaotang Gai、Jian Wu、Zuozhu Liu、Yuan Wang

医学研究方法医学现状、医学发展临床医学

Jiaxiang Liu,Shujian Gao,Bin Feng,Zhihang Tang,Xiaotang Gai,Jian Wu,Zuozhu Liu,Yuan Wang.V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis[EB/OL].(2025-06-27)[2025-07-25].https://arxiv.org/abs/2506.19610.点此复制

评论