TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
temporal searchlong video understandingreinforcement learninglarge video language model
Abstract:

Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Many existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text–video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves substantial improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, long-form video understanding benchmarks like VideoMME, MLVU, and LongVideoBench, as well as video reasoning benchmarks such as Video-Holmes, consistently and significantly outperforming other existing temporal search approaches and text-only reasoning models. All the code, models, and data will be released soon.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes TimeSearch-R, which reformulates temporal search as interleaved text-video reasoning optimized via reinforcement learning, and introduces GRPO-CSV to verify search completeness. It resides in the Query-Driven Temporal Search leaf, which contains only three papers including the original work. This leaf sits within the broader Temporal Search and Frame Selection Methods branch, indicating a relatively focused research direction. The small sibling count suggests this specific formulation—query-driven retrieval with RL-based optimization—occupies a less crowded niche compared to adjacent areas like Adaptive Frame Sampling or Agent-Based Systems.

The taxonomy reveals neighboring work in Adaptive Frame Sampling and Keyframe Selection, which emphasizes content saliency over query-response mechanisms, and Agent-Based Systems, where tools like VideoAgent and Vgent employ iterative multi-step reasoning. The Query-Driven leaf explicitly excludes methods without query-response mechanisms, positioning TimeSearch-R closer to retrieval-augmented approaches than to curiosity-driven exploration. The Training Strategies branch includes Preference Optimization and Reinforcement Learning, housing one paper on Temporal Preference Optimization, suggesting the RL-based training angle connects to emerging optimization trends but remains underexplored in the temporal search context.

Among nine candidates examined, three appear to refute the first contribution (TimeSearch-R framework), while the GRPO-CSV algorithm and dataset construction contributions show no refutable candidates in the limited search. The framework contribution's overlap with prior work likely stems from existing query-driven retrieval methods like Rethinking Temporal Search and T-Star, which also perform adaptive frame selection. The GRPO-CSV algorithm, receiving no examination in the candidate set, may represent a more novel methodological angle, though this reflects search scope rather than exhaustive coverage. The dataset contribution similarly lacks examined candidates, leaving its novelty less constrained by the available evidence.

Based on the top-nine semantic matches, the framework contribution appears to build incrementally on established query-driven retrieval paradigms, while the algorithmic and dataset contributions remain less scrutinized within this limited scope. The taxonomy structure confirms that query-driven temporal search is a defined but sparsely populated area, with only two sibling papers. This analysis captures what the search reveals but does not preclude additional relevant work outside the examined candidate pool or in adjacent taxonomy branches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
9
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: temporal search for long-form video understanding. The field addresses the challenge of efficiently processing extended video sequences by developing methods that selectively identify and retrieve relevant temporal segments rather than exhaustively encoding every frame. The taxonomy reveals a diverse landscape organized around several complementary themes. Temporal Search and Frame Selection Methods focus on query-driven and adaptive sampling strategies that pinpoint informative frames, while Token and Memory Efficiency Mechanisms tackle the computational bottleneck through compression and merging techniques such as Video Token Merging[2]. Agent-Based and Reasoning Systems introduce iterative, goal-oriented exploration of video content, and Temporal Modeling and Representation Learning emphasize learning robust temporal features. Training Strategies and Optimization refine model behavior through techniques like Temporal Preference Optimization[5], and Benchmarks and Evaluation Datasets such as MLVU Benchmark[4] and Egoschema[12] provide standardized testbeds. Additional branches cover Multimodal and Spatial-Temporal Understanding, Efficient Model Architectures and Scaling, and Fine-Grained Temporal Reasoning, collectively spanning the spectrum from low-level efficiency to high-level semantic interpretation. Within this ecosystem, a particularly active line of work explores query-driven temporal search, where models dynamically select frames based on question or task context. TimeSearch-R[0] exemplifies this approach by leveraging retrieval mechanisms to locate relevant temporal windows in long videos, closely aligning with efforts like Rethinking Temporal Search[1] and T-Star[25], which similarly emphasize adaptive, question-aware frame selection. These methods contrast with more static sampling strategies and agent-based systems such as VideoAgent[22] or Vgent[28], which iteratively refine their search through multi-step reasoning. A key trade-off emerges between the computational overhead of dynamic search and the risk of missing critical context with fixed sampling. TimeSearch-R[0] sits squarely in the query-driven cluster, sharing conceptual ground with T-Star[25] in prioritizing relevance-based retrieval, yet differing in the specifics of how temporal cues guide frame selection. Open questions remain around balancing search efficiency with comprehensive coverage, especially as video lengths and task complexity continue to grow.

Claimed Contributions

TimeSearch-R framework for adaptive temporal search via reinforcement learning

The authors introduce TimeSearch-R, a framework that reformulates temporal search as an interleaved text-video thinking process. This approach enables the model to learn optimal search strategies directly from data through end-to-end reinforcement learning, rather than relying on hand-crafted workflows.

9 retrieved papers
Can Refute
GRPO with Completeness Self-Verification (GRPO-CSV) algorithm

The authors propose GRPO-CSV, a novel reinforcement learning algorithm that addresses insufficient temporal exploration and inconsistent logical reasoning. It supervises intermediate search decisions by verifying the adequacy of searched frames using the same policy model, ensuring completeness of video reasoning.

0 retrieved papers
High-quality video reasoning dataset construction via two-stage filtering

The authors construct a high-quality video reasoning dataset through a two-stage filtering pipeline. This dataset removes trivial samples solvable through linguistic bias and noisy unsolvable samples, ensuring the model learns correct temporal search processes for GRPO-CSV training.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TimeSearch-R framework for adaptive temporal search via reinforcement learning

The authors introduce TimeSearch-R, a framework that reformulates temporal search as an interleaved text-video thinking process. This approach enables the model to learn optimal search strategies directly from data through end-to-end reinforcement learning, rather than relying on hand-crafted workflows.

Contribution

GRPO with Completeness Self-Verification (GRPO-CSV) algorithm

The authors propose GRPO-CSV, a novel reinforcement learning algorithm that addresses insufficient temporal exploration and inconsistent logical reasoning. It supervises intermediate search decisions by verifying the adequacy of searched frames using the same policy model, ensuring completeness of video reasoning.

Contribution

High-quality video reasoning dataset construction via two-stage filtering

The authors construct a high-quality video reasoning dataset through a two-stage filtering pipeline. This dataset removes trivial samples solvable through linguistic bias and noisy unsolvable samples, ensuring the model learns correct temporal search processes for GRPO-CSV training.