|国家预印本平台
首页|Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

来源:Arxiv_logoArxiv
英文摘要

Effectively representing 3D scenes for Multimodal Large Language Models (MLLMs) is crucial yet challenging. Existing approaches commonly only rely on 2D image features and use varied tokenization approaches. This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations while maintaining consistent model backbones and parameters. We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder. Our experiments demonstrate that merging explicit 3D features significantly boosts performance. Furthermore, we show that point-based token structures can rival video-based ones when the points are cleverly sampled and ordered. Our best models from both structures achieve state-of-the-art results on multiple 3D understanding benchmarks. We emphasize our analysis of token structures as a key contribution, alongside transparent reporting of results averaged over multiple seeds, a practice we believe is vital for robust progress in the field.

Hugues Thomas、Chen Chen、Jian Zhang

计算技术、计算机技术

Hugues Thomas,Chen Chen,Jian Zhang.Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models[EB/OL].(2025-06-05)[2025-06-16].https://arxiv.org/abs/2506.05689.点此复制

评论