首页|LoViC: Efficient Long Video Generation with Context Compression

LoViC: Efficient Long Video Generation with Context Compression

来源：

英文摘要

Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

作者：Jiaxiu Jiang、Wenbo Li、Jingjing Ren、Yuping Qiu、Yong Guo、Xiaogang Xu、Han Wu、Wangmeng Zuo

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Jiaxiu Jiang,Wenbo Li,Jingjing Ren,Yuping Qiu,Yong Guo,Xiaogang Xu,Han Wu,Wangmeng Zuo.LoViC: Efficient Long Video Generation with Context Compression[EB/OL].(2025-07-17)[2025-08-10].https://arxiv.org/abs/2507.12952.点此复制

LoViC: Efficient Long Video Generation with Context Compression

LoViC: Efficient Long Video Generation with Context Compression

评论