MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zicheng Liu Linjie Li Kevin Lin Zhengyuan Yang Michael Zeng Lijuan Wang Faisal Ahmed Jianfeng Wang Ce Liu Ehsan Azarnasab

学科分类：计算技术、计算机技术

推荐引用：Zicheng Liu,Linjie Li,Kevin Lin,Zhengyuan Yang,Michael Zeng,Lijuan Wang,Faisal Ahmed,Jianfeng Wang,Ce Liu,Ehsan Azarnasab.MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action[EB/OL].(2023-03-20)[2025-10-01].https://arxiv.org/abs/2303.11381.点此复制

Abstract：We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

展开英文信息

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

评论