首页|Dense360: Dense Understanding from Omnidirectional Panoramas

Dense360: Dense Understanding from Omnidirectional Panoramas

来源：

英文摘要

Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.

作者：Yikang Zhou、Tao Zhang、Dizhe Zhang、Shunping Ji、Xiangtai Li、Lu Qi

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Yikang Zhou,Tao Zhang,Dizhe Zhang,Shunping Ji,Xiangtai Li,Lu Qi.Dense360: Dense Understanding from Omnidirectional Panoramas[EB/OL].(2025-06-17)[2025-06-30].https://arxiv.org/abs/2506.14471.点此复制

Dense360: Dense Understanding from Omnidirectional Panoramas

Dense360: Dense Understanding from Omnidirectional Panoramas

评论