Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Overview
Overall Novelty Assessment
The paper introduces Mini-o3, a system for deep multi-turn visual search using reinforcement learning to execute reasoning trajectories spanning tens of steps. It resides in the 'Deep Multi-turn Visual Search with RL' leaf, which contains four papers including the original work. This leaf sits within the broader 'Reinforcement Learning for Multi-turn Tool-based Reasoning' branch, indicating a moderately populated research direction focused on RL-driven tool orchestration. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MMSearch-R1 and Deepmmsearch-r1 pursuing similar extended reasoning paradigms.
The taxonomy positions this work at the intersection of RL-based tool-use and multi-turn reasoning, distinct from supervised frameworks (e.g., LLaVA-Plus, Beyond Seeing) that rely on curated demonstrations rather than trial-and-error learning. Neighboring leaves include 'Agentic Tool-use with Visual Reasoning via RL' and 'Multi-step RL for Reasoning and Tool Integration', which explore synthetic data generation and reward shaping. The scope note for the original leaf explicitly excludes single-turn or shallow reasoning systems, clarifying that Mini-o3's contribution lies in scaling interaction depth rather than breadth of tool types or domains.
Among 23 candidates examined, none clearly refute the three core contributions. The Visual Probe Dataset examined 10 candidates with zero refutable overlaps, suggesting novelty in constructing challenging visual search problems for exploratory reasoning. The iterative data collection pipeline also examined 10 candidates without refutation, indicating that the approach to generating diverse cold-start trajectories (depth-first search, trial-and-error, goal maintenance) appears distinct within the limited search scope. The over-turn masking strategy examined only 3 candidates, reflecting a more specialized technical contribution with no identified prior work in the sample.
Based on the limited search of 23 semantically similar papers, the work appears to introduce novel components within its specific niche of deep multi-turn visual search. The taxonomy context shows a moderately active research area with clear boundaries separating RL-based from supervised approaches. However, the analysis does not cover exhaustive literature review or adjacent fields outside the top-K semantic matches, leaving open the possibility of related work in broader RL or visual reasoning domains not captured here.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a new dataset containing 4,000 training and 500 test visual question-answer pairs across three difficulty levels. The dataset features small targets, numerous distractor objects, and high-resolution images to require iterative exploration and trial-and-error reasoning.
The authors propose a pipeline that uses in-context learning with manually crafted exemplars to generate approximately 6,000 multi-turn trajectories. These trajectories demonstrate varied reasoning strategies such as depth-first search, self-reflection, and goal maintenance for supervised fine-tuning initialization.
The authors introduce a masking technique that avoids penalizing trajectories exceeding the training-time turn limit by masking their advantages during policy updates. This enables test-time scaling where models trained with only 6 turns can naturally extend to tens of turns at inference while maintaining training efficiency.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] MMSearch-R1: Incentivizing LMMs to Search PDF
[16] Deepmmsearch-r1: Empowering multimodal llms in multimodal web search PDF
[27] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Visual Probe Dataset for challenging visual search
The authors introduce a new dataset containing 4,000 training and 500 test visual question-answer pairs across three difficulty levels. The dataset features small targets, numerous distractor objects, and high-resolution images to require iterative exploration and trial-and-error reasoning.
[33] Towards large-scale small object detection: Survey and benchmarks PDF
[34] Visible-thermal tiny object detection: A benchmark dataset and baselines PDF
[35] Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark PDF
[36] PD components and distractor inhibition in visual search: New evidence for the signal suppression hypothesis. PDF
[37] Set-size effects in visual search: The effect of attention is independent of the stimulus for simple tasks PDF
[38] The effect of target salience and size in visual search within naturalistic scenes under degraded vision PDF
[39] Tracking small and fast moving objects: A benchmark PDF
[40] Target grouping in visual search for multiple digits PDF
[41] The role of categorization in visual search for orientation. PDF
[42] The Impact of Perceptual Load and Distractors' Perceptual Grouping on Visual Search in ASD PDF
Iterative data collection pipeline for diverse cold-start trajectories
The authors propose a pipeline that uses in-context learning with manually crafted exemplars to generate approximately 6,000 multi-turn trajectories. These trajectories demonstrate varied reasoning strategies such as depth-first search, self-reflection, and goal maintenance for supervised fine-tuning initialization.
[43] Learning to search effective example sequences for in-context learning PDF
[44] Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts PDF
[45] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF
[46] Teaching Algorithmic Reasoning via In-context Learning PDF
[47] Finding Support Examples for In-Context Learning PDF
[48] Making language models better reasoners with step-aware verifier PDF
[49] Reasoning with large language models, a survey PDF
[50] Igniting language intelligence: The hitchhiker's guide from chain-of-thought reasoning to language agents PDF
[51] Reasoning graph enhanced exemplars retrieval for In-Context learning PDF
[52] Personalized Vision via Visual In-Context Learning PDF
Over-turn masking strategy for reinforcement learning
The authors introduce a masking technique that avoids penalizing trajectories exceeding the training-time turn limit by masking their advantages during policy updates. This enables test-time scaling where models trained with only 6 turns can naturally extend to tens of turns at inference while maintaining training efficiency.