Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
Vision-Language-Action (VLA) models have recently become highly prominent in the field of robotics. Leveraging vision-language foundation models trained on large-scale internet data, the VLA model can generate robotic actions directly from visual observations and human instructions through a single end-to-end neural network. Despite their effectiveness, current VLA models usually accept only one form of human prompting, language instructions, which may constrain their applicability in open-ended human-robot interactions. For example, a user might expect the robot to retrieve an object shown in an image, follow an instruction written on the whiteboard, or imitate a behavior demonstrated in a video, rather than relying solely on language-based descriptions. To address this gap, we introduce OE-VLA, which explores the potential of VLA models for open-ended multimodal instructions. Extensive results demonstrate that our OE-VLA not only achieves comparable performance to traditional VLA models with linguistic input but also delivers impressive results across four additional categories of open-ended tasks. The proposed methodology could significantly expand the applications of VLA models across various everyday scenarios and facilitate human-robot interaction.
Wei Zhao、Gongsheng Li、Zhefei Gong、Pengxiang Ding、Han Zhao、Donglin Wang
计算技术、计算机技术
Wei Zhao,Gongsheng Li,Zhefei Gong,Pengxiang Ding,Han Zhao,Donglin Wang.Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions[EB/OL].(2025-05-16)[2025-06-15].https://arxiv.org/abs/2505.11214.点此复制
评论