|国家预印本平台
首页|Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

来源:Arxiv_logoArxiv
英文摘要

In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/WiserZhou/MTID.

Yufan Zhou、Zhaobo Qi、Lingshuai Lin、Junqi Jing、Tingting Chai、Beichen Zhang、Shuhui Wang、Weigang Zhang

计算技术、计算机技术

Yufan Zhou,Zhaobo Qi,Lingshuai Lin,Junqi Jing,Tingting Chai,Beichen Zhang,Shuhui Wang,Weigang Zhang.Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos[EB/OL].(2025-07-04)[2025-07-25].https://arxiv.org/abs/2507.03393.点此复制

评论