VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.
Jin Tae Kwak、Trinh T. L. Vuong
医学研究方法临床医学
Jin Tae Kwak,Trinh T. L. Vuong.VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning[EB/OL].(2025-05-07)[2025-05-21].https://arxiv.org/abs/2505.04192.点此复制
评论