|国家预印本平台
首页|Training-Free Efficient Video Generation via Dynamic Token Carving

Training-Free Efficient Video Generation via Dynamic Token Carving

Training-Free Efficient Video Generation via Dynamic Token Carving

来源:Arxiv_logoArxiv
英文摘要

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

Yuechen Zhang、Jinbo Xing、Bin Xia、Shaoteng Liu、Bohao Peng、Xin Tao、Pengfei Wan、Eric Lo、Jiaya Jia

计算技术、计算机技术

Yuechen Zhang,Jinbo Xing,Bin Xia,Shaoteng Liu,Bohao Peng,Xin Tao,Pengfei Wan,Eric Lo,Jiaya Jia.Training-Free Efficient Video Generation via Dynamic Token Carving[EB/OL].(2025-05-22)[2025-06-30].https://arxiv.org/abs/2505.16864.点此复制

评论