TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living
Arkaprava Sinha Dominick Reilly Siddharth Krishnan Hieu Le Srijan Das
作者信息
Abstract
Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.引用本文复制引用
Arkaprava Sinha,Dominick Reilly,Siddharth Krishnan,Hieu Le,Srijan Das.TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living[EB/OL].(2026-06-18)[2026-06-21].https://arxiv.org/abs/2606.20561.学科分类
计算技术、计算机技术