首页|Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM

Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM

来源：

英文摘要

Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.

作者：Peng Liu、Xiaoming Ren、Fengkai Liu、Qingsong Xie、Quanlong Zheng、Yanhao Zhang、Haonan Lu、Yujiu Yang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Peng Liu,Xiaoming Ren,Fengkai Liu,Qingsong Xie,Quanlong Zheng,Yanhao Zhang,Haonan Lu,Yujiu Yang.Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM[EB/OL].(2025-05-26)[2025-06-27].https://arxiv.org/abs/2505.19901.点此复制

Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM

Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM

评论