首页|Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

来源：

Arxiv

英文摘要

With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.

作者：Bo Liu、Ameer Hamza Khan、Lu Fan、Li Xu、Xiao-Ming Wu

作者单位：

学科分类：医学研究方法计算技术、计算机技术

推荐引用：Bo Liu,Ameer Hamza Khan,Lu Fan,Li Xu,Xiao-Ming Wu.Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark[EB/OL].(2023-06-10)[2025-05-22].https://arxiv.org/abs/2306.06494.点此复制

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

评论