TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

arXiv · PDF · 코드 미공개

초록

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

저자 (5명)

Arkaprava Sinha — LinkedIn 검색
Dominick Reilly — LinkedIn 검색
Siddharth Krishnan — LinkedIn 검색
Hieu Le — LinkedIn 검색
Srijan Das — LinkedIn 검색

저자 LinkedIn 변경 추적은 추후 자동화 예정입니다. 현재는 검색 링크를 제공합니다.