首页|Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

来源：

英文摘要

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

作者：Joel Currie、Gioele Migno、Enrico Piacenti、Maria Elena Giannaccini、Patric Bach、Davide De Tommaso、Agnieszka Wykowska

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Joel Currie,Gioele Migno,Enrico Piacenti,Maria Elena Giannaccini,Patric Bach,Davide De Tommaso,Agnieszka Wykowska.Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds[EB/OL].(2025-05-20)[2025-07-17].https://arxiv.org/abs/2505.14366.点此复制

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

评论