首页|Training-Free Efficient Video Generation via Dynamic Token Carving

Training-Free Efficient Video Generation via Dynamic Token Carving

来源：

英文摘要

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

作者：Yuechen Zhang、Jinbo Xing、Bin Xia、Shaoteng Liu、Bohao Peng、Xin Tao、Pengfei Wan、Eric Lo、Jiaya Jia

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Yuechen Zhang,Jinbo Xing,Bin Xia,Shaoteng Liu,Bohao Peng,Xin Tao,Pengfei Wan,Eric Lo,Jiaya Jia.Training-Free Efficient Video Generation via Dynamic Token Carving[EB/OL].(2025-05-22)[2025-06-30].https://arxiv.org/abs/2505.16864.点此复制

Training-Free Efficient Video Generation via Dynamic Token Carving

Training-Free Efficient Video Generation via Dynamic Token Carving

评论