|国家预印本平台
首页|Atlas: Multi-Scale Attention Improves Long Context Image Modeling

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

Atlas: Multi-Scale Attention Improves Long Context Image Modeling

来源:Arxiv_logoArxiv
英文摘要

Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.

Boyi Li、Alexander Bick、Trevor Darrell、Long Lian、Longchao Liu、Kumar Krishna Agrawal、Maggie Chung、Natalia Harguindeguy、Adam Yala

计算技术、计算机技术

Boyi Li,Alexander Bick,Trevor Darrell,Long Lian,Longchao Liu,Kumar Krishna Agrawal,Maggie Chung,Natalia Harguindeguy,Adam Yala.Atlas: Multi-Scale Attention Improves Long Context Image Modeling[EB/OL].(2025-03-16)[2025-04-26].https://arxiv.org/abs/2503.12355.点此复制

评论