0

TAR: Temporal Anchor-Constrained Reasoning for Video Temporal Grounding

Video Temporal Grounding (VTG) aims to localize specific video segments corresponding to natural language queries. While recent Large Vision-Language Models (LVLMs) employ Reinforcement Learning to generate Chains-of-Thought (CoT), they typically rely solely on outcome-based…

Preview
Year
2025
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2508.07683CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Video Temporal Grounding (VTG) aims to localize specific video segments corresponding to natural language queries. While recent Large Vision-Language Models (LVLMs) employ Reinforcement Learning to generate Chains-of-Thought (CoT), they typically rely solely on outcome-based supervision. Consequently, this often leads to hallucinations, where the reasoning process becomes disconnected from the visual content and the final prediction. Existing attempts to mitigate this by relying on external supervision from larger models or separate reward models are computationally expensive and prone to rigid patterns. To address these challenges, we propose TAR (Temporal Anchor-Constrained Reasoning), a framework that introduces the temporal anchor (T-anchor) as a transparent and auditable checkpoint mechanism. T-anchor enforces progressive refinement within the CoT, compelling the model to continuously ground its intermediate thoughts in visual evidence and iteratively calibrate temporal predictions, thereby significantly enhancing the faithfulness and autonomy of the reasoning process and final accuracy. Furthermore, we introduce a bootstrapping paradigm that automatically harvests high-quality CoT data using only a standard 7B model, eliminating the dependency on ultra-large models. Extensive experiments demonstrate that TAR achieves state-of-the-art performance and generates faithful, autonomous, and progressively refined reasoning traces.