Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal reasoningvision-language modelaction recognition
Abstract:

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invokes domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, while maintaining computational efficiency.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Video-STAR proposes a framework that decomposes actions into sub-motions and employs tool-augmented reinforcement learning for open-vocabulary action recognition. It resides in the 'Tool-Augmented and Multimodal Reasoning' leaf, which contains only two papers total. This leaf sits within the broader 'Core Open-Vocabulary Recognition Frameworks' branch, indicating the paper targets a relatively sparse research direction focused on integrating external tools and fine-grained reasoning rather than standard vision-language adaptation.

The taxonomy reveals neighboring leaves such as 'Vision-Language Model Adaptation' (four papers) and 'Prompt-Based and Semantic Enhancement' (four papers), both emphasizing direct CLIP adaptation or prompt optimization. Video-STAR diverges by introducing tool invocation and sub-motion decomposition, moving beyond static prompt engineering. The 'Robustness and Debiasing' leaf (three papers) addresses distributional shifts, while Video-STAR's hierarchical reward mechanism targets cross-modal hallucination and reasoning coherence, suggesting a complementary but distinct focus on structured inference rather than debiasing alone.

Among eleven candidates examined, none clearly refute the three core contributions. The framework integration (one candidate examined) and multimodal tool library (ten candidates examined) both show zero refutable overlaps. The hierarchical reward mechanism was not directly compared against prior work in the search. This limited scope—eleven candidates from semantic search—means the analysis captures immediate neighbors but cannot confirm exhaustive novelty. The absence of refutations suggests the specific combination of sub-motion decomposition, tool-augmented RL, and hierarchical rewards is not prominently represented in the examined literature.

Based on top-eleven semantic matches, Video-STAR appears to occupy a distinct niche within tool-augmented reasoning for action recognition. The sparse population of its taxonomy leaf and lack of direct prior work overlap indicate potential novelty, though the limited search scope precludes definitive claims. A broader literature review would be needed to assess whether similar tool-based or sub-motion strategies exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
11
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: open-vocabulary action recognition. The field has evolved to recognize actions beyond fixed label sets, organizing itself into several major branches. Core Open-Vocabulary Recognition Frameworks explore foundational methods that leverage vision-language models and multimodal reasoning to generalize across diverse action vocabularies, often employing tool-augmented strategies or iterative prompting techniques such as Iterative Visual Prompting[5] and Dynamic Frame Selection[35]. Open-Vocabulary Action Detection and Localization extends these ideas to spatio-temporal settings, addressing when and where actions occur in video, as seen in works like Spatio-Temporal Action Detection[2] and Scaling Action Detection[1]. Domain-Specific Open-Vocabulary Recognition targets specialized contexts—egocentric videos, hand actions, or skeleton-based modalities—while Open-Set Action Recognition focuses on rejecting unknown classes and handling distributional shifts. Finally, Evaluation, Benchmarking, and Related Tasks provide datasets and metrics to assess generalization and cross-domain robustness. Recent lines of work reveal contrasting emphases: some studies prioritize scaling and weakly-supervised pretraining to broaden action vocabularies, while others refine prompting and reasoning mechanisms to improve zero-shot transfer. Video-STAR[0] sits within the Tool-Augmented and Multimodal Reasoning cluster, emphasizing structured reasoning over raw video features. Compared to Dynamic Frame Selection[35], which optimizes frame sampling for efficiency, Video-STAR[0] integrates external knowledge or tool-based inference to handle complex action semantics. This approach aligns with a growing interest in combining vision-language backbones with auxiliary reasoning modules, contrasting with purely end-to-end adaptation methods like Action-Conditioned Prompts[3]. The central trade-off remains between leveraging large-scale pretraining for broad coverage and designing specialized reasoning pathways for nuanced, context-dependent action understanding.

Claimed Contributions

Video-STAR framework integrating sub-motion decomposition with tool-augmented reinforcement learning

The authors introduce Video-STAR, a unified framework that combines contextual sub-motion decomposition with tool-augmented reinforcement learning. This approach decomposes actions into discriminative sub-motion primitives for fine-grained matching while dynamically invoking domain-specific tools, enabling category-specific reasoning and reducing cross-modal hallucination in open-vocabulary action recognition.

1 retrieved paper
Multimodal tool library for dynamic cross-modal reasoning

The authors design a multimodal tool library that integrates pose estimation, human detection, and online retrieval capabilities. This library dynamically augments the reasoning process with domain-specific knowledge to resolve cross-modal hallucinations during inference.

10 retrieved papers
Hierarchical reward mechanism for reinforcement learning

The authors propose a hierarchical reward function that balances tool-usage efficiency, sub-motion relevance, and structural coherence. This mechanism ensures tools are activated only when meaningful, while sub-motion hierarchies are weighted to prioritize semantically salient components, enabling the model to autonomously leverage external tools without explicit supervision.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Video-STAR framework integrating sub-motion decomposition with tool-augmented reinforcement learning

The authors introduce Video-STAR, a unified framework that combines contextual sub-motion decomposition with tool-augmented reinforcement learning. This approach decomposes actions into discriminative sub-motion primitives for fine-grained matching while dynamically invoking domain-specific tools, enabling category-specific reasoning and reducing cross-modal hallucination in open-vocabulary action recognition.

Contribution

Multimodal tool library for dynamic cross-modal reasoning

The authors design a multimodal tool library that integrates pose estimation, human detection, and online retrieval capabilities. This library dynamically augments the reasoning process with domain-specific knowledge to resolve cross-modal hallucinations during inference.

Contribution

Hierarchical reward mechanism for reinforcement learning

The authors propose a hierarchical reward function that balances tool-usage efficiency, sub-motion relevance, and structural coherence. This mechanism ensures tools are activated only when meaningful, while sub-motion hierarchies are weighted to prioritize semantically salient components, enabling the model to autonomously leverage external tools without explicit supervision.

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools | Novelty Validation