|国家预印本平台
首页|Dense360: Dense Understanding from Omnidirectional Panoramas

Dense360: Dense Understanding from Omnidirectional Panoramas

Dense360: Dense Understanding from Omnidirectional Panoramas

来源:Arxiv_logoArxiv
英文摘要

Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.

Yikang Zhou、Tao Zhang、Dizhe Zhang、Shunping Ji、Xiangtai Li、Lu Qi

计算技术、计算机技术

Yikang Zhou,Tao Zhang,Dizhe Zhang,Shunping Ji,Xiangtai Li,Lu Qi.Dense360: Dense Understanding from Omnidirectional Panoramas[EB/OL].(2025-06-17)[2025-06-30].https://arxiv.org/abs/2506.14471.点此复制

评论