|国家预印本平台
首页|Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module

来源:Arxiv_logoArxiv
英文摘要

Medical report generation requires specialized expertise that general large models often fail to accurately capture. Moreover, the inherent repetition and similarity in medical data make it difficult for models to extract meaningful features, resulting in a tendency to overfit. So in this paper, we propose a multimodal model, Co-Attention Triple-LSTM Network (CA-TriNet), a deep learning model that combines transformer architectures with a Multi-LSTM network. Its Co-Attention module synergistically links a vision transformer with a text transformer to better differentiate medical images with similarities, augmented by an adaptive weight operator to catch and amplify image labels with minor similarities. Furthermore, its Triple-LSTM module refines generated sentences using targeted image objects. Extensive evaluations over three public datasets have demonstrated that CA-TriNet outperforms state-of-the-art models in terms of comprehensive ability, even pre-trained large language models on some metrics.

Yishen Liu、Shengda Liu、Hudan Pan

医学研究方法

Yishen Liu,Shengda Liu,Hudan Pan.Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module[EB/OL].(2025-03-23)[2025-05-07].https://arxiv.org/abs/2503.18297.点此复制

评论