首页|Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

来源：

英文摘要

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

作者：Haoyu Wu、Diankun Wu、Tianyu He、Junliang Guo、Yang Ye、Yueqi Duan、Jiang Bian

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Haoyu Wu,Diankun Wu,Tianyu He,Junliang Guo,Yang Ye,Yueqi Duan,Jiang Bian.Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling[EB/OL].(2025-07-10)[2025-07-18].https://arxiv.org/abs/2507.07982.点此复制

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

评论