OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

ICLR 2026 Conference SubmissionAnonymous Authors
Spatio-Temporal Video GroundingSpatio-Temporal Omni-Object Video GroundingBenchmark
Abstract:

We introduce spatio-temporal omni-object video grounding, dubbed OmniSTVG\textbf{OmniSTVG}, a new STVG task aiming to localize spatially and temporally all targets mentioned in the textual query within videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we propose BOSTVG\textbf{BOSTVG}, a large-scale benchmark dedicated to OmniSTVG. Specifically, BOSTVG contains 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG, to date, is the first and the largest benchmark for OmniSTVG. To encourage future research, we present a simple yet effective approach, named OmniTube\textbf{OmniTube}, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark and code will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OmniSTVG, a task for localizing all text-mentioned targets and their interacting counterparts in videos with spatio-temporal tubes. It resides in the Multi-Object and Omni-Object Grounding leaf, which contains four papers including this one. This leaf sits within the broader Spatio-Temporal Video Grounding and Localization branch, distinguishing itself from Single-Object Grounding (three papers) and Action-Centric Grounding (two papers) siblings. The relatively small cluster size suggests this is an emerging research direction rather than a saturated area, though the parent branch encompasses multiple active subcategories addressing temporal moment localization and action-centric methods.

The taxonomy reveals neighboring work in closely related directions. Single-Object Grounding methods focus on localizing one primary target per query, while Action-Centric Grounding emphasizes events and their interacting objects. Temporal Moment Localization (five papers) addresses temporal boundaries without spatial boxes, representing a complementary problem formulation. The broader Video-Language Models branch (six papers across three leaves) explores foundation models with grounding capabilities, and Weakly-Supervised Localization (seven papers) reduces annotation requirements. OmniSTVG's emphasis on localizing all query-mentioned objects plus their interacting counterparts positions it at the intersection of multi-object tracking and comprehensive language-driven understanding, diverging from single-target or action-only paradigms.

Among thirty candidates examined, the OmniSTVG task contribution shows one refutable candidate from ten examined, suggesting some prior work addresses similar multi-object grounding formulations within the limited search scope. The BOSTVG benchmark contribution examined ten candidates with zero refutations, indicating the dataset's scale and annotation protocol may offer distinct value. The OmniTube method contribution also examined ten candidates without refutation, though this does not preclude related architectural approaches in the broader literature. The statistics reflect a focused semantic search rather than exhaustive coverage, meaning additional relevant work may exist beyond the top-thirty matches analyzed here.

Based on the limited search scope of thirty semantically similar papers, the work appears to advance multi-object grounding with a comprehensive benchmark and method. The task formulation shows some overlap with prior efforts, while the dataset and approach contributions exhibit less direct precedent among examined candidates. The taxonomy context suggests this research direction remains relatively sparse compared to single-object or temporal-only localization, though the analysis cannot definitively assess novelty beyond the top-K semantic neighborhood explored.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: spatio-temporal omni-object video grounding seeks to localize and track multiple objects in video sequences based on natural language descriptions, integrating spatial bounding boxes with temporal dynamics. The field's taxonomy reveals several complementary branches. Spatio-Temporal Video Grounding and Localization focuses on methods that jointly reason about when and where objects appear, often employing transformer-based architectures to align linguistic queries with video regions across frames. Weakly-Supervised and Unsupervised Video Object Localization explores techniques that reduce annotation burden by learning from noisy or incomplete labels. Video-Language Models and Foundation Models leverage large-scale pretraining to build general-purpose representations that transfer across diverse grounding tasks. Meanwhile, Video Object Detection and Tracking emphasizes robust instance-level tracking, and Video Scene Understanding branches into holistic scene graphs and relational reasoning. Multimodal Video Understanding integrates audio, text, and visual cues, while Video Segmentation and Manipulation addresses pixel-level precision and editing. Specialized Video Analysis Tasks cover domain-specific challenges such as egocentric or sketch-based localization, and Spatio-Temporal Modeling Foundations provides core architectural innovations like memory-enhanced detection and temporal attention mechanisms. Within the Multi-Object and Omni-Object Grounding cluster, recent works tackle the challenge of grounding diverse object categories and multiple instances simultaneously. OmniSTVG[0] exemplifies this direction by proposing a unified framework for omni-object grounding that handles varied object types and complex temporal dependencies. Nearby efforts such as SVAG-Bench[4] introduce comprehensive benchmarks to evaluate grounding across diverse scenarios, while Universal Video Grounding[5] aims for generalization across object classes and query styles. Described Spatial-Temporal Detection[8] emphasizes fine-grained alignment between descriptive language and spatio-temporal tubes, contrasting with earlier memory-based approaches like Memory Enhanced Detection[6] that prioritize long-range temporal consistency. These works collectively highlight trade-offs between model generality, annotation efficiency, and temporal reasoning depth, with OmniSTVG[0] positioned as a holistic solution that bridges multi-object tracking with flexible language-driven localization.

Claimed Contributions

OmniSTVG task for spatio-temporal omni-object video grounding

The authors propose a new task called OmniSTVG that extends classic spatio-temporal video grounding by localizing all objects mentioned in a textual query, including both targets of interest and their interacting counterparts, rather than only a single target.

10 retrieved papers
Can Refute
BOSTVG benchmark dataset for OmniSTVG

The authors introduce BOSTVG, a large-scale benchmark dataset containing 10,018 videos with over 10 million frames across 287 object categories. Each video is paired with a free-form textual query and manually annotated with spatio-temporal tubes for all mentioned targets.

10 retrieved papers
OmniTube method for OmniSTVG

The authors develop OmniTube, a Transformer-based approach specifically designed for the OmniSTVG task. It uses text-guided query generation and a spatio-temporal decoder to simultaneously localize multiple objects mentioned in textual queries from videos.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OmniSTVG task for spatio-temporal omni-object video grounding

The authors propose a new task called OmniSTVG that extends classic spatio-temporal video grounding by localizing all objects mentioned in a textual query, including both targets of interest and their interacting counterparts, rather than only a single target.

Contribution

BOSTVG benchmark dataset for OmniSTVG

The authors introduce BOSTVG, a large-scale benchmark dataset containing 10,018 videos with over 10 million frames across 287 object categories. Each video is paired with a free-form textual query and manually annotated with spatio-temporal tubes for all mentioned targets.

Contribution

OmniTube method for OmniSTVG

The authors develop OmniTube, a Transformer-based approach specifically designed for the OmniSTVG task. It uses text-guided query generation and a spatio-temporal decoder to simultaneously localize multiple objects mentioned in textual queries from videos.