首页|Video-GPT via Next Clip Diffusion

Video-GPT via Next Clip Diffusion

来源：

英文摘要

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://zhuangshaobin.github.io/Video-GPT.github.io/.

作者：Shaobin Zhuang、Zhipeng Huang、Ying Zhang、Fangyikang Wang、Canmiao Fu、Binxin Yang、Chong Sun、Chen Li、Yali Wang

作者单位：

学科分类：信息科学、信息技术计算技术、计算机技术

推荐引用：Shaobin Zhuang,Zhipeng Huang,Ying Zhang,Fangyikang Wang,Canmiao Fu,Binxin Yang,Chong Sun,Chen Li,Yali Wang.Video-GPT via Next Clip Diffusion[EB/OL].(2025-05-18)[2025-06-04].https://arxiv.org/abs/2505.12489.点此复制

Video-GPT via Next Clip Diffusion

Video-GPT via Next Clip Diffusion

评论