首页|SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

来源：

英文摘要

Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitation hinders natural and robust interaction in dynamic settings, such as retail or household environments, where real-time intent changes are common. We propose SwitchVLA, a unified, execution-aware framework that enables smooth and reactive task switching without external planners or additional switch-specific data. We model task switching as a behavior modulation problem conditioned on execution state and instruction context. Expert demonstrations are segmented into temporally grounded contact phases, allowing the policy to infer task progress and adjust its behavior accordingly. A multi-behavior conditional policy is then trained to generate flexible action chunks under varying behavior modes through conditioned trajectory modeling. Experiments in both simulation and real-world robotic manipulation demonstrate that SwitchVLA enables robust instruction adherence, fluid task switching, and strong generalization-outperforming prior VLA baselines in both task success rate and interaction naturalness.

作者：Meng Li、Zhen Zhao、Zhengping Che、Fei Liao、Kun Wu、Zhiyuan Xu、Pei Ren、Zhao Jin、Ning Liu、Jian Tang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Meng Li,Zhen Zhao,Zhengping Che,Fei Liao,Kun Wu,Zhiyuan Xu,Pei Ren,Zhao Jin,Ning Liu,Jian Tang.SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models[EB/OL].(2025-06-04)[2025-07-21].https://arxiv.org/abs/2506.03574.点此复制

SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

评论