|国家预印本平台
首页|VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

来源:Arxiv_logoArxiv
英文摘要

Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some complex scenarios. To address this issue, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation framework which takes advantage of the powerful perception capability of vision language model (VLM) and integrates it with MPC. Specifically, we propose a conditional action sampling module which takes as input a goal image or a language instruction and leverages VLM to sample a set of candidate action sequences. Then, a lightweight action-conditioned video prediction model is designed to generate a set of future frames conditioned on the candidate action sequences. VLMPC produces the optimal action sequence with the assistance of VLM through a hierarchical cost function that formulates both pixel-level and knowledge-level consistence between the current observation and the goal image. We demonstrate that VLMPC outperforms the state-of-the-art methods on public benchmarks. More importantly, our method showcases excellent performance in various real-world tasks of robotic manipulation. Code is available at~\url{https://github.com/PPjmchen/VLMPC}.

Ziyu Meng、Wei Zhang、Jiaming Chen、Donghui Mao、Wentao Zhao、Ran Song

自动化技术、自动化技术设备计算技术、计算机技术

Ziyu Meng,Wei Zhang,Jiaming Chen,Donghui Mao,Wentao Zhao,Ran Song.VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation[EB/OL].(2024-07-13)[2025-04-29].https://arxiv.org/abs/2407.09829.点此复制

评论