首页|ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

来源：

英文摘要

Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.

作者：Puhao Li、Yingying Wu、Ziheng Xi、Wanlin Li、Yuzhe Huang、Zhiyuan Zhang、Yinghan Chen、Jianan Wang、Song-Chun Zhu、Tengyu Liu、Siyuan Huang

作者单位：

学科分类：自动化技术、自动化技术设备计算技术、计算机技术

推荐引用：Puhao Li,Yingying Wu,Ziheng Xi,Wanlin Li,Yuzhe Huang,Zhiyuan Zhang,Yinghan Chen,Jianan Wang,Song-Chun Zhu,Tengyu Liu,Siyuan Huang.ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models[EB/OL].(2025-06-19)[2025-07-09].https://arxiv.org/abs/2506.16211.点此复制

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

评论