首页|EPIC: Efficient Prompt Interaction for Text-Image Classification

EPIC: Efficient Prompt Interaction for Text-Image Classification

来源：

英文摘要

In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.

作者：Xinyao Yu、Hao Sun、Zeyu Ling、Ziwei Niu、Zhenjia Bai、Rui Qin、Yen-Wei Chen、Lanfen Lin

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Xinyao Yu,Hao Sun,Zeyu Ling,Ziwei Niu,Zhenjia Bai,Rui Qin,Yen-Wei Chen,Lanfen Lin.EPIC: Efficient Prompt Interaction for Text-Image Classification[EB/OL].(2025-07-10)[2025-07-25].https://arxiv.org/abs/2507.07415.点此复制

EPIC: Efficient Prompt Interaction for Text-Image Classification

EPIC: Efficient Prompt Interaction for Text-Image Classification

评论