OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Spatio-Temporal Video GroundingSpatio-Temporal Omni-Object Video GroundingBenchmark

We introduce spatio-temporal omni-object video grounding, dubbed $\textbf{OmniSTVG}$ , a new STVG task aiming to localize spatially and temporally all targets mentioned in the textual query within videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we propose $\textbf{BOSTVG}$ , a large-scale benchmark dedicated to OmniSTVG. Specifically, BOSTVG contains 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG, to date, is the first and the largest benchmark for OmniSTVG. To encourage future research, we present a simple yet effective approach, named $\textbf{OmniTube}$ , which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark and code will be released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OmniSTVG, a task for localizing all text-mentioned targets and their interacting counterparts in videos with spatio-temporal tubes. It resides in the Multi-Object and Omni-Object Grounding leaf, which contains four papers including this one. This leaf sits within the broader Spatio-Temporal Video Grounding and Localization branch, distinguishing itself from Single-Object Grounding (three papers) and Action-Centric Grounding (two papers) siblings. The relatively small cluster size suggests this is an emerging research direction rather than a saturated area, though the parent branch encompasses multiple active subcategories addressing temporal moment localization and action-centric methods.

The taxonomy reveals neighboring work in closely related directions. Single-Object Grounding methods focus on localizing one primary target per query, while Action-Centric Grounding emphasizes events and their interacting objects. Temporal Moment Localization (five papers) addresses temporal boundaries without spatial boxes, representing a complementary problem formulation. The broader Video-Language Models branch (six papers across three leaves) explores foundation models with grounding capabilities, and Weakly-Supervised Localization (seven papers) reduces annotation requirements. OmniSTVG's emphasis on localizing all query-mentioned objects plus their interacting counterparts positions it at the intersection of multi-object tracking and comprehensive language-driven understanding, diverging from single-target or action-only paradigms.

Among thirty candidates examined, the OmniSTVG task contribution shows one refutable candidate from ten examined, suggesting some prior work addresses similar multi-object grounding formulations within the limited search scope. The BOSTVG benchmark contribution examined ten candidates with zero refutations, indicating the dataset's scale and annotation protocol may offer distinct value. The OmniTube method contribution also examined ten candidates without refutation, though this does not preclude related architectural approaches in the broader literature. The statistics reflect a focused semantic search rather than exhaustive coverage, meaning additional relevant work may exist beyond the top-thirty matches analyzed here.

Based on the limited search scope of thirty semantically similar papers, the work appears to advance multi-object grounding with a comprehensive benchmark and method. The task formulation shows some overlap with prior efforts, while the dataset and approach contributions exhibit less direct precedent among examined candidates. The taxonomy context suggests this research direction remains relatively sparse compared to single-object or temporal-only localization, though the analysis cannot definitively assess novelty beyond the top-K semantic neighborhood explored.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: spatio-temporal omni-object video grounding seeks to localize and track multiple objects in video sequences based on natural language descriptions, integrating spatial bounding boxes with temporal dynamics. The field's taxonomy reveals several complementary branches. Spatio-Temporal Video Grounding and Localization focuses on methods that jointly reason about when and where objects appear, often employing transformer-based architectures to align linguistic queries with video regions across frames. Weakly-Supervised and Unsupervised Video Object Localization explores techniques that reduce annotation burden by learning from noisy or incomplete labels. Video-Language Models and Foundation Models leverage large-scale pretraining to build general-purpose representations that transfer across diverse grounding tasks. Meanwhile, Video Object Detection and Tracking emphasizes robust instance-level tracking, and Video Scene Understanding branches into holistic scene graphs and relational reasoning. Multimodal Video Understanding integrates audio, text, and visual cues, while Video Segmentation and Manipulation addresses pixel-level precision and editing. Specialized Video Analysis Tasks cover domain-specific challenges such as egocentric or sketch-based localization, and Spatio-Temporal Modeling Foundations provides core architectural innovations like memory-enhanced detection and temporal attention mechanisms. Within the Multi-Object and Omni-Object Grounding cluster, recent works tackle the challenge of grounding diverse object categories and multiple instances simultaneously. OmniSTVG[0] exemplifies this direction by proposing a unified framework for omni-object grounding that handles varied object types and complex temporal dependencies. Nearby efforts such as SVAG-Bench[4] introduce comprehensive benchmarks to evaluate grounding across diverse scenarios, while Universal Video Grounding[5] aims for generalization across object classes and query styles. Described Spatial-Temporal Detection[8] emphasizes fine-grained alignment between descriptive language and spatio-temporal tubes, contrasting with earlier memory-based approaches like Memory Enhanced Detection[6] that prioritize long-range temporal consistency. These works collectively highlight trade-offs between model generality, annotation efficiency, and temporal reasoning depth, with OmniSTVG[0] positioned as a holistic solution that bridges multi-object tracking with flexible language-driven localization.

Claimed Contributions

OmniSTVG task for spatio-temporal omni-object video grounding

Can Refute

10 retrieved papers

The authors propose a new task called OmniSTVG that extends classic spatio-temporal video grounding by localizing all objects mentioned in a textual query, including both targets of interest and their interacting counterparts, rather than only a single target.

10 retrieved papers

Can Refute

BOSTVG benchmark dataset for OmniSTVG

10 retrieved papers

The authors introduce BOSTVG, a large-scale benchmark dataset containing 10,018 videos with over 10 million frames across 287 object categories. Each video is paired with a free-form textual query and manually annotated with spatio-temporal tubes for all mentioned targets.

10 retrieved papers

OmniTube method for OmniSTVG

10 retrieved papers

The authors develop OmniTube, a Transformer-based approach specifically designed for the OmniSTVG task. It uses text-guided query generation and a spatio-temporal decoder to simultaneously localize multiple objects mentioned in textual queries from videos.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding PDF

Hannan, Tanveer, Tanveer Hannan, Weber, Mark, Shuaicong Wu, Shit, Suprosanna, Mark Weber, Gu, Jindong, Suprosanna Shit, Koner, Rajat, Jindong Gu, OÅ¡ep, AljoÅ¡a, Rajat Koner, Leal-TaixÃ©, Laura, Aljosa Osep, Seidl, Thomas, Laura Leal-Taix'e, Thomas Seidl (2025) • arXiv.org

[8] Described spatial-temporal video detection PDF

Ji Wei, Liu Xiangyan, Wei Ji, Sun Ying-fei, Xiangyan Liu, Deng Jia-Jun, Yingfei Sun, Qin You, Jiajun Deng, Youxuan Qin, Qiu, Mengyao, Ammar Nuwanna, Wei Lina, Mengyao Qiu, Zimmermann, Roger, Lina Wei, Roger Zimmermann (2024)

[23] Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences PDF

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao (2020) • Computer Vision and Pattern Recognition

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OmniSTVG task for spatio-temporal omni-object video grounding

[8] Described spatial-temporal video detection PDF

Can Refute

[33] Weakly-supervised video object grounding by exploring spatio-temporal contexts PDF

Cannot Refute

[51] Stpro: Spatial and temporal progressive learning for weakly supervised spatio-temporal grounding PDF

Cannot Refute

[52] Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding PDF

Cannot Refute

[53] Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding PDF

Cannot Refute

[54] WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding PDF

Cannot Refute

[55] VoCap: Video Object Captioning and Segmentation from Any Prompt PDF

Cannot Refute

[56] TubeDETR: Spatio-Temporal Video Grounding with Transformers PDF

Cannot Refute

[57] VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos PDF

Cannot Refute

[58] Zero-shot Natural Language Video Localization PDF

Cannot Refute

Contribution

BOSTVG benchmark dataset for OmniSTVG

[33] Weakly-supervised video object grounding by exploring spatio-temporal contexts PDF

Cannot Refute

[64] Context-Guided Spatio-Temporal Video Grounding PDF

Cannot Refute

[69] A survey on temporal sentence grounding in videos PDF

Cannot Refute

[70] EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT PDF

Cannot Refute

[71] Human-centric spatio-temporal video grounding with visual transformers PDF

Cannot Refute

[72] Fine-grained spatiotemporal grounding on egocentric videos PDF

Cannot Refute

[73] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding PDF

Cannot Refute

[74] Scene-text grounding for text-based video question answering PDF

Cannot Refute

[75] Tvqa+: Spatio-temporal grounding for video question answering PDF

Cannot Refute

[76] A survey on video temporal grounding with multimodal large language model PDF

Cannot Refute

Contribution

OmniTube method for OmniSTVG

[59] Query-dependent video representation for moment retrieval and highlight detection PDF

Cannot Refute

[60] Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection PDF

Cannot Refute

[61] Context-enhanced video moment retrieval with large language models PDF

Cannot Refute

[62] Temporal refinement and multi-grained matching for moment retrieval and highlight detection PDF

Cannot Refute

[63] LITA: Language Instructed Temporal-Localization Assistant PDF

Cannot Refute

[64] Context-Guided Spatio-Temporal Video Grounding PDF

Cannot Refute

[65] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding PDF

Cannot Refute

[66] SA-DETR: Span Aware Detection Transformer for Moment Retrieval PDF

Cannot Refute

[67] Video graph transformer for video question answering PDF

Cannot Refute

[68] Momentum cross-modal contrastive learning for video moment retrieval PDF

Cannot Refute

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding PDF

[8] Described spatial-temporal video detection PDF

[23] Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences PDF

Contribution Analysis

OmniSTVG task for spatio-temporal omni-object video grounding

[8] Described spatial-temporal video detection PDF

[33] Weakly-supervised video object grounding by exploring spatio-temporal contexts PDF

[51] Stpro: Spatial and temporal progressive learning for weakly supervised spatio-temporal grounding PDF

[52] Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding PDF

[53] Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding PDF

[54] WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding PDF

[55] VoCap: Video Object Captioning and Segmentation from Any Prompt PDF

[56] TubeDETR: Spatio-Temporal Video Grounding with Transformers PDF

[57] VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos PDF

[58] Zero-shot Natural Language Video Localization PDF

BOSTVG benchmark dataset for OmniSTVG

[33] Weakly-supervised video object grounding by exploring spatio-temporal contexts PDF

[64] Context-Guided Spatio-Temporal Video Grounding PDF

[69] A survey on temporal sentence grounding in videos PDF

[70] EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT PDF

[71] Human-centric spatio-temporal video grounding with visual transformers PDF

[72] Fine-grained spatiotemporal grounding on egocentric videos PDF

[73] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding PDF

[74] Scene-text grounding for text-based video question answering PDF

[75] Tvqa+: Spatio-temporal grounding for video question answering PDF

[76] A survey on video temporal grounding with multimodal large language model PDF

OmniTube method for OmniSTVG

[59] Query-dependent video representation for moment retrieval and highlight detection PDF

[60] Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection PDF

[61] Context-enhanced video moment retrieval with large language models PDF

[62] Temporal refinement and multi-grained matching for moment retrieval and highlight detection PDF

[63] LITA: Language Instructed Temporal-Localization Assistant PDF

[64] Context-Guided Spatio-Temporal Video Grounding PDF

[65] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding PDF

[66] SA-DETR: Span Aware Detection Transformer for Moment Retrieval PDF

[67] Video graph transformer for video question answering PDF

[68] Momentum cross-modal contrastive learning for video moment retrieval PDF

Table of Contents