|国家预印本平台
首页|A Survey on Efficient Vision-Language Models

A Survey on Efficient Vision-Language Models

A Survey on Efficient Vision-Language Models

来源:Arxiv_logoArxiv
英文摘要

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.

Gaurav Shinde、Anuradha Ravi、Emon Dey、Shadman Sakib、Milind Rampure、Nirmalya Roy

计算技术、计算机技术

Gaurav Shinde,Anuradha Ravi,Emon Dey,Shadman Sakib,Milind Rampure,Nirmalya Roy.A Survey on Efficient Vision-Language Models[EB/OL].(2025-04-13)[2025-05-05].https://arxiv.org/abs/2504.09724.点此复制

评论