Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization
Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization
In this work, we investigate how spatially grounded auxiliary representations can provide both broad, high-level grounding as well as direct, actionable information to improve policy learning performance and generalization for dexterous tasks. We study these mid-level representations across three critical dimensions: object-centricity, pose-awareness, and depth-awareness. We use these interpretable mid-level representations to train specialist encoders via supervised learning, then feed them as inputs to a diffusion policy to solve dexterous bimanual manipulation tasks in the real world. We propose a novel mixture-of-experts policy architecture that combines multiple specialized expert models, each trained on a distinct mid-level representation, to improve policy generalization. This method achieves an average success rate that is 11% higher than a language-grounded baseline and 24 percent higher than a standard diffusion policy baseline on our evaluation tasks. Furthermore, we find that leveraging mid-level representations as supervision signals for policy actions within a weighted imitation learning algorithm improves the precision with which the policy follows these representations, yielding an additional performance increase of 10%. Our findings highlight the importance of grounding robot policies not only with broad perceptual tasks but also with more granular, actionable representations. For further information and videos, please visit https://mid-level-moe.github.io.
Jonathan Yang、Chuyuan Kelly Fu、Dhruv Shah、Dorsa Sadigh、Fei Xia、Tingnan Zhang
自动化技术、自动化技术设备
Jonathan Yang,Chuyuan Kelly Fu,Dhruv Shah,Dorsa Sadigh,Fei Xia,Tingnan Zhang.Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization[EB/OL].(2025-06-06)[2025-07-16].https://arxiv.org/abs/2506.06196.点此复制
评论