Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Visual Search;Thinking-with-images;Reinforcement Learning;

Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning—spanning tens of steps—and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3–style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mini-o3, a system for deep multi-turn visual search using reinforcement learning to execute reasoning trajectories spanning tens of steps. It resides in the 'Deep Multi-turn Visual Search with RL' leaf, which contains four papers including the original work. This leaf sits within the broader 'Reinforcement Learning for Multi-turn Tool-based Reasoning' branch, indicating a moderately populated research direction focused on RL-driven tool orchestration. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MMSearch-R1 and Deepmmsearch-r1 pursuing similar extended reasoning paradigms.

The taxonomy positions this work at the intersection of RL-based tool-use and multi-turn reasoning, distinct from supervised frameworks (e.g., LLaVA-Plus, Beyond Seeing) that rely on curated demonstrations rather than trial-and-error learning. Neighboring leaves include 'Agentic Tool-use with Visual Reasoning via RL' and 'Multi-step RL for Reasoning and Tool Integration', which explore synthetic data generation and reward shaping. The scope note for the original leaf explicitly excludes single-turn or shallow reasoning systems, clarifying that Mini-o3's contribution lies in scaling interaction depth rather than breadth of tool types or domains.

Among 23 candidates examined, none clearly refute the three core contributions. The Visual Probe Dataset examined 10 candidates with zero refutable overlaps, suggesting novelty in constructing challenging visual search problems for exploratory reasoning. The iterative data collection pipeline also examined 10 candidates without refutation, indicating that the approach to generating diverse cold-start trajectories (depth-first search, trial-and-error, goal maintenance) appears distinct within the limited search scope. The over-turn masking strategy examined only 3 candidates, reflecting a more specialized technical contribution with no identified prior work in the sample.

Based on the limited search of 23 semantically similar papers, the work appears to introduce novel components within its specific niche of deep multi-turn visual search. The taxonomy context shows a moderately active research area with clear boundaries separating RL-based from supervised approaches. However, the analysis does not cover exhaustive literature review or adjacent fields outside the top-K semantic matches, leaving open the possibility of related work in broader RL or visual reasoning domains not captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-turn visual search with image-based tool interactions. The field encompasses diverse approaches to enabling agents to iteratively query, retrieve, and reason over visual information using external tools. At the highest level, the taxonomy distinguishes reinforcement learning methods that train agents to select and orchestrate tools through trial-and-error (e.g., Mini-o3[0], MMSearch-R1[1], Deepmmsearch-r1[16]) from supervised and instruction-tuned frameworks that rely on curated demonstrations or fine-tuning (e.g., LLaVA-Plus[6], Beyond Seeing[7]). Parallel branches address retrieval-augmented generation for multimodal tasks (M2RAG[20], MARAG-R1[23]), benchmarking and evaluation suites that measure tool-use capabilities (PhotoScout[17], Knowledge-based Visual QA[18]), interactive user interfaces for visual search (VISAtlas[10], ChatEdit[13]), and foundational surveys that map the broader landscape of agentic systems (AI Agent Systems[21]). Additional specialized branches cover code-based repository search agents and legacy interactive systems (Visual Information Seeking[19], JIGSAW[32]), reflecting the evolution from early graphical interfaces to modern neural tool-use paradigms. Within the reinforcement learning branch, a particularly active line of work focuses on deep multi-turn visual search with RL, where agents learn to chain multiple tool calls—such as image retrieval, segmentation, or captioning—over extended reasoning trajectories. Mini-o3[0] exemplifies this direction by training policies that adaptively select tools and refine queries across turns, closely aligning with MMSearch-R1[1] and Deepmmsearch-r1[16], which similarly emphasize iterative search and tool orchestration. In contrast, works like Visual Agentic Reinforcement[3] and Synthetic Data Multi-Step RL[4] explore synthetic data generation and reward shaping to improve sample efficiency, while Patho-agenticrag[5] applies agentic retrieval-augmented reasoning to domain-specific pathology tasks. The central tension across these efforts lies in balancing exploration complexity—how many turns and tools to consider—with the reliability and interpretability of learned policies, a challenge that Mini-o3[0] addresses through structured reward signals and multi-turn rollout strategies.

Claimed Contributions

Visual Probe Dataset for challenging visual search

10 retrieved papers

The authors introduce a new dataset containing 4,000 training and 500 test visual question-answer pairs across three difficulty levels. The dataset features small targets, numerous distractor objects, and high-resolution images to require iterative exploration and trial-and-error reasoning.

10 retrieved papers

Iterative data collection pipeline for diverse cold-start trajectories

10 retrieved papers

The authors propose a pipeline that uses in-context learning with manually crafted exemplars to generate approximately 6,000 multi-turn trajectories. These trajectories demonstrate varied reasoning strategies such as depth-first search, self-reflection, and goal maintenance for supervised fine-tuning initialization.

10 retrieved papers

Over-turn masking strategy for reinforcement learning

3 retrieved papers

The authors introduce a masking technique that avoids penalizing trajectories exceeding the training-time turn limit by masking their advantages during policy updates. This enables test-time scaling where models trained with only 6 turns can naturally extend to tens of turns at inference while maintaining training efficiency.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] MMSearch-R1: Incentivizing LMMs to Search PDF

Wu Jin-Ming, Deng Zihao, Jinming Wu, Li Wei, Zihao Deng, Liu Yiding, Wei Li, You Bo, Yiding Liu, Li Bo, Bo You, Ma, Zejun, Bo Li, Liu, Ziwei, Zejun Ma, Ziwei Liu (2025) • arXiv.org

[16] Deepmmsearch-r1: Empowering multimodal llms in multimodal web search PDF

Narayan, Kartik, Xu Yang, Kartik Narayan, Cao Tian, Yang Xu, Nerella, Kavya, Tian Cao, Patel, Vishal M., Kavya Nerella, Vishal M. Patel, Grasch, Peter, Navid Shiee, Jia-chao, Peter Grasch, Yang, Yinfei, Chao Jia, Gan Zhe, Yinfei Yang, Zhe Gan (2025)

[27] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning PDF

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Probe Dataset for challenging visual search

[33] Towards large-scale small object detection: Survey and benchmarks PDF

Cannot Refute

[34] Visible-thermal tiny object detection: A benchmark dataset and baselines PDF

Cannot Refute

[35] Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark PDF

Cannot Refute

[36] PD components and distractor inhibition in visual search: New evidence for the signal suppression hypothesis. PDF

Cannot Refute

[37] Set-size effects in visual search: The effect of attention is independent of the stimulus for simple tasks PDF

Cannot Refute

[38] The effect of target salience and size in visual search within naturalistic scenes under degraded vision PDF

Cannot Refute

[39] Tracking small and fast moving objects: A benchmark PDF

Cannot Refute

[40] Target grouping in visual search for multiple digits PDF

Cannot Refute

[41] The role of categorization in visual search for orientation. PDF

Cannot Refute

[42] The Impact of Perceptual Load and Distractors' Perceptual Grouping on Visual Search in ASD PDF

Cannot Refute

Contribution

Iterative data collection pipeline for diverse cold-start trajectories

[43] Learning to search effective example sequences for in-context learning PDF

Cannot Refute

[44] Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts PDF

Cannot Refute

[45] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

Cannot Refute

[46] Teaching Algorithmic Reasoning via In-context Learning PDF

Cannot Refute

[47] Finding Support Examples for In-Context Learning PDF

Cannot Refute

[48] Making language models better reasoners with step-aware verifier PDF

Cannot Refute

[49] Reasoning with large language models, a survey PDF

Cannot Refute

[50] Igniting language intelligence: The hitchhiker's guide from chain-of-thought reasoning to language agents PDF

Cannot Refute

[51] Reasoning graph enhanced exemplars retrieval for In-Context learning PDF

Cannot Refute

[52] Personalized Vision via Visual In-Context Learning PDF

Cannot Refute

Contribution

Over-turn masking strategy for reinforcement learning

[53] UloRL: An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities PDF

Cannot Refute

[54] Trust Region Masking for Long-Horizon LLM Reinforcement Learning PDF

Cannot Refute

[55] Attention-masking extended deep Q network (AME-DQN) reinforcement learning algorithm for combinatory optimization of smart-grid energy PDF

Cannot Refute

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] MMSearch-R1: Incentivizing LMMs to Search PDF

[16] Deepmmsearch-r1: Empowering multimodal llms in multimodal web search PDF

[27] SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning PDF

Contribution Analysis

Visual Probe Dataset for challenging visual search

[33] Towards large-scale small object detection: Survey and benchmarks PDF

[34] Visible-thermal tiny object detection: A benchmark dataset and baselines PDF

[35] Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark PDF

[36] PD components and distractor inhibition in visual search: New evidence for the signal suppression hypothesis. PDF

[37] Set-size effects in visual search: The effect of attention is independent of the stimulus for simple tasks PDF

[38] The effect of target salience and size in visual search within naturalistic scenes under degraded vision PDF

[39] Tracking small and fast moving objects: A benchmark PDF

[40] Target grouping in visual search for multiple digits PDF

[41] The role of categorization in visual search for orientation. PDF

[42] The Impact of Perceptual Load and Distractors' Perceptual Grouping on Visual Search in ASD PDF

Iterative data collection pipeline for diverse cold-start trajectories

[43] Learning to search effective example sequences for in-context learning PDF

[44] Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts PDF

[45] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

[46] Teaching Algorithmic Reasoning via In-context Learning PDF

[47] Finding Support Examples for In-Context Learning PDF

[48] Making language models better reasoners with step-aware verifier PDF

[49] Reasoning with large language models, a survey PDF

[50] Igniting language intelligence: The hitchhiker's guide from chain-of-thought reasoning to language agents PDF

[51] Reasoning graph enhanced exemplars retrieval for In-Context learning PDF

[52] Personalized Vision via Visual In-Context Learning PDF

Over-turn masking strategy for reinforcement learning

[53] UloRL: An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities PDF

[54] Trust Region Masking for Long-Horizon LLM Reinforcement Learning PDF

[55] Attention-masking extended deep Q network (AME-DQN) reinforcement learning algorithm for combinatory optimization of smart-grid energy PDF

Table of Contents