|国家预印本平台
首页|CoMP: Continual Multimodal Pre-training for Vision Foundation Models

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

来源:Arxiv_logoArxiv
英文摘要

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.

Yitong Chen、Lingchen Meng、Wujian Peng、Zuxuan Wu、Yu-Gang Jiang

计算技术、计算机技术

Yitong Chen,Lingchen Meng,Wujian Peng,Zuxuan Wu,Yu-Gang Jiang.CoMP: Continual Multimodal Pre-training for Vision Foundation Models[EB/OL].(2025-03-24)[2025-05-05].https://arxiv.org/abs/2503.18931.点此复制

评论