首页|What Makes Multimodal In-Context Learning Work?

What Makes Multimodal In-Context Learning Work?

来源：

英文摘要

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

作者：Matthieu Cord、Mustafa Shukor、Folco Bertini Baldassini、Laure Soulier、Benjamin Piwowarski

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Matthieu Cord,Mustafa Shukor,Folco Bertini Baldassini,Laure Soulier,Benjamin Piwowarski.What Makes Multimodal In-Context Learning Work?[EB/OL].(2024-04-24)[2025-07-16].https://arxiv.org/abs/2404.15736.点此复制

What Makes Multimodal In-Context Learning Work?

What Makes Multimodal In-Context Learning Work?

评论