HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation
HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation
To address key limitations in human-object interaction (HOI) video generation -- specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility -- we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate audio-driven lip synchronization. Extensive experiments confirm state-of-the-art performance in interaction naturalness and generalization under weak supervision. Finally, HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and interactive object manipulation, supported by a user-friendly demo interface. The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.
Ziyao Huang、Zixiang Zhou、Juan Cao、Yifeng Ma、Yi Chen、Zejing Rao、Zhiyong Xu、Hongmei Wang、Qin Lin、Yuan Zhou、Qinglin Lu、Fan Tang
计算技术、计算机技术
Ziyao Huang,Zixiang Zhou,Juan Cao,Yifeng Ma,Yi Chen,Zejing Rao,Zhiyong Xu,Hongmei Wang,Qin Lin,Yuan Zhou,Qinglin Lu,Fan Tang.HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation[EB/OL].(2025-06-10)[2025-07-16].https://arxiv.org/abs/2506.08797.点此复制
评论