首页|Deeper Inside Deep ViT

Deeper Inside Deep ViT

来源：

英文摘要

There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.

作者：Sungrae Hong

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Sungrae Hong.Deeper Inside Deep ViT[EB/OL].(2025-08-06)[2025-08-17].https://arxiv.org/abs/2508.04181.点此复制

Deeper Inside Deep ViT

Deeper Inside Deep ViT

评论