首页|Video CLIP Model for Multi-View Echocardiography Interpretation

Video CLIP Model for Multi-View Echocardiography Interpretation

来源：

英文摘要

Echocardiography involves recording videos of the heart using ultrasound, enabling clinicians to evaluate its condition. Recent advances in large-scale vision-language models (VLMs) have garnered attention for automating the interpretation of echocardiographic videos. However, most existing VLMs proposed for medical interpretation thus far rely on single-frame (i.e., image) inputs. Consequently, these image-based models often exhibit lower diagnostic accuracy for conditions identifiable through cardiac motion. Moreover, echocardiographic videos are recorded from various views that depend on the direction of ultrasound emission, and certain views are more suitable than others for interpreting specific conditions. Incorporating multiple views could potentially yield further improvements in accuracy. In this study, we developed a video-language model that takes five different views and full video sequences as input, training it on pairs of echocardiographic videos and clinical reports from 60,747 cases. Our experiments demonstrate that this expanded approach achieves higher interpretation accuracy than models trained with only single-view videos or with still images.

作者：Norihiko Takeda、Ryo Takizawa、Satoshi Kodera、Tempei Kabayama、Ryo Matsuoka、Yuta Ando、Yuto Nakamura、Haruki Settai

作者单位：

学科分类：基础医学临床医学医学研究方法

推荐引用：Norihiko Takeda,Ryo Takizawa,Satoshi Kodera,Tempei Kabayama,Ryo Matsuoka,Yuta Ando,Yuto Nakamura,Haruki Settai.Video CLIP Model for Multi-View Echocardiography Interpretation[EB/OL].(2025-04-26)[2025-06-06].https://arxiv.org/abs/2504.18800.点此复制

Video CLIP Model for Multi-View Echocardiography Interpretation

Video CLIP Model for Multi-View Echocardiography Interpretation

评论