Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.
Haochen Wang、Yucheng Zhao、Tiancai Wang、Haoqiang Fan、Xiangyu Zhang、Zhaoxiang Zhang
计算技术、计算机技术
Haochen Wang,Yucheng Zhao,Tiancai Wang,Haoqiang Fan,Xiangyu Zhang,Zhaoxiang Zhang.Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness[EB/OL].(2025-04-02)[2025-05-07].https://arxiv.org/abs/2504.01901.点此复制
评论