OminiControl: Minimal and Universal Control for Diffusion Transformer
OminiControl: Minimal and Universal Control for Diffusion Transformer
We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only specific control tasks effectively, limiting their practical versatility. OminiControl addresses these limitations through three key innovations: (1) a minimal architectural design that leverages the DiT's own VAE encoder and transformer blocks, requiring just 0.1% additional parameters; (2) a unified sequence processing strategy that combines condition tokens with image tokens for flexible token interactions; and (3) a dynamic position encoding mechanism that adapts to both spatially-aligned and non-aligned control tasks. Our extensive experiments show that this streamlined approach not only matches but surpasses the performance of specialized methods across multiple conditioning tasks. To overcome data limitations in subject-driven generation, we also introduce Subjects200K, a large-scale dataset of identity-consistent image pairs synthesized using DiT models themselves. This work demonstrates that effective image control can be achieved without architectural complexity, opening new possibilities for efficient and versatile image generation systems.
Xinchao Wang、Xingyi Yang、Zhenxiong Tan、Qiaochu Xue、Songhua Liu
计算技术、计算机技术
Xinchao Wang,Xingyi Yang,Zhenxiong Tan,Qiaochu Xue,Songhua Liu.OminiControl: Minimal and Universal Control for Diffusion Transformer[EB/OL].(2024-11-22)[2025-05-08].https://arxiv.org/abs/2411.15098.点此复制
评论