NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation
NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation
Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.
Junyuan Fang、Zihan Wang、Yejun Zhang、Shuzhe Wang、Iaroslav Melekhov、Juho Kannala
Aalto University, Espoo, FinlandAalto University, Espoo, FinlandAalto University, Espoo, FinlandAalto University, Espoo, FinlandAalto University, Espoo, FinlandAalto University, Espoo, Finland
计算技术、计算机技术
Junyuan Fang,Zihan Wang,Yejun Zhang,Shuzhe Wang,Iaroslav Melekhov,Juho Kannala.NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation[EB/OL].(2025-04-20)[2025-05-26].https://arxiv.org/abs/2504.14638.点此复制
评论