All in One: Visual-Description-Guided Unified Point Cloud Segmentation
All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.
Zongyan Han、Mohamed El Amine Boudjoghra、Jiahua Dong、Jinhong Wang、Rao Muhammad Anwer
计算技术、计算机技术
Zongyan Han,Mohamed El Amine Boudjoghra,Jiahua Dong,Jinhong Wang,Rao Muhammad Anwer.All in One: Visual-Description-Guided Unified Point Cloud Segmentation[EB/OL].(2025-07-07)[2025-07-25].https://arxiv.org/abs/2507.05211.点此复制
评论