Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

ICLR 2026 Conference SubmissionAnonymous Authors
Visual Search;Thinking-with-images;Reinforcement Learning;
Abstract:

Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning—spanning tens of steps—and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3–style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mini-o3, a system for deep multi-turn visual search using reinforcement learning to execute reasoning trajectories spanning tens of steps. It resides in the 'Deep Multi-turn Visual Search with RL' leaf, which contains four papers including the original work. This leaf sits within the broader 'Reinforcement Learning for Multi-turn Tool-based Reasoning' branch, indicating a moderately populated research direction focused on RL-driven tool orchestration. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MMSearch-R1 and Deepmmsearch-r1 pursuing similar extended reasoning paradigms.

The taxonomy positions this work at the intersection of RL-based tool-use and multi-turn reasoning, distinct from supervised frameworks (e.g., LLaVA-Plus, Beyond Seeing) that rely on curated demonstrations rather than trial-and-error learning. Neighboring leaves include 'Agentic Tool-use with Visual Reasoning via RL' and 'Multi-step RL for Reasoning and Tool Integration', which explore synthetic data generation and reward shaping. The scope note for the original leaf explicitly excludes single-turn or shallow reasoning systems, clarifying that Mini-o3's contribution lies in scaling interaction depth rather than breadth of tool types or domains.

Among 23 candidates examined, none clearly refute the three core contributions. The Visual Probe Dataset examined 10 candidates with zero refutable overlaps, suggesting novelty in constructing challenging visual search problems for exploratory reasoning. The iterative data collection pipeline also examined 10 candidates without refutation, indicating that the approach to generating diverse cold-start trajectories (depth-first search, trial-and-error, goal maintenance) appears distinct within the limited search scope. The over-turn masking strategy examined only 3 candidates, reflecting a more specialized technical contribution with no identified prior work in the sample.

Based on the limited search of 23 semantically similar papers, the work appears to introduce novel components within its specific niche of deep multi-turn visual search. The taxonomy context shows a moderately active research area with clear boundaries separating RL-based from supervised approaches. However, the analysis does not cover exhaustive literature review or adjacent fields outside the top-K semantic matches, leaving open the possibility of related work in broader RL or visual reasoning domains not captured here.

Taxonomy

Core-task Taxonomy Papers
32
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multi-turn visual search with image-based tool interactions. The field encompasses diverse approaches to enabling agents to iteratively query, retrieve, and reason over visual information using external tools. At the highest level, the taxonomy distinguishes reinforcement learning methods that train agents to select and orchestrate tools through trial-and-error (e.g., Mini-o3[0], MMSearch-R1[1], Deepmmsearch-r1[16]) from supervised and instruction-tuned frameworks that rely on curated demonstrations or fine-tuning (e.g., LLaVA-Plus[6], Beyond Seeing[7]). Parallel branches address retrieval-augmented generation for multimodal tasks (M2RAG[20], MARAG-R1[23]), benchmarking and evaluation suites that measure tool-use capabilities (PhotoScout[17], Knowledge-based Visual QA[18]), interactive user interfaces for visual search (VISAtlas[10], ChatEdit[13]), and foundational surveys that map the broader landscape of agentic systems (AI Agent Systems[21]). Additional specialized branches cover code-based repository search agents and legacy interactive systems (Visual Information Seeking[19], JIGSAW[32]), reflecting the evolution from early graphical interfaces to modern neural tool-use paradigms. Within the reinforcement learning branch, a particularly active line of work focuses on deep multi-turn visual search with RL, where agents learn to chain multiple tool calls—such as image retrieval, segmentation, or captioning—over extended reasoning trajectories. Mini-o3[0] exemplifies this direction by training policies that adaptively select tools and refine queries across turns, closely aligning with MMSearch-R1[1] and Deepmmsearch-r1[16], which similarly emphasize iterative search and tool orchestration. In contrast, works like Visual Agentic Reinforcement[3] and Synthetic Data Multi-Step RL[4] explore synthetic data generation and reward shaping to improve sample efficiency, while Patho-agenticrag[5] applies agentic retrieval-augmented reasoning to domain-specific pathology tasks. The central tension across these efforts lies in balancing exploration complexity—how many turns and tools to consider—with the reliability and interpretability of learned policies, a challenge that Mini-o3[0] addresses through structured reward signals and multi-turn rollout strategies.

Claimed Contributions

Visual Probe Dataset for challenging visual search

The authors introduce a new dataset containing 4,000 training and 500 test visual question-answer pairs across three difficulty levels. The dataset features small targets, numerous distractor objects, and high-resolution images to require iterative exploration and trial-and-error reasoning.

10 retrieved papers
Iterative data collection pipeline for diverse cold-start trajectories

The authors propose a pipeline that uses in-context learning with manually crafted exemplars to generate approximately 6,000 multi-turn trajectories. These trajectories demonstrate varied reasoning strategies such as depth-first search, self-reflection, and goal maintenance for supervised fine-tuning initialization.

10 retrieved papers
Over-turn masking strategy for reinforcement learning

The authors introduce a masking technique that avoids penalizing trajectories exceeding the training-time turn limit by masking their advantages during policy updates. This enables test-time scaling where models trained with only 6 turns can naturally extend to tens of turns at inference while maintaining training efficiency.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Probe Dataset for challenging visual search

The authors introduce a new dataset containing 4,000 training and 500 test visual question-answer pairs across three difficulty levels. The dataset features small targets, numerous distractor objects, and high-resolution images to require iterative exploration and trial-and-error reasoning.

Contribution

Iterative data collection pipeline for diverse cold-start trajectories

The authors propose a pipeline that uses in-context learning with manually crafted exemplars to generate approximately 6,000 multi-turn trajectories. These trajectories demonstrate varied reasoning strategies such as depth-first search, self-reflection, and goal maintenance for supervised fine-tuning initialization.

Contribution

Over-turn masking strategy for reinforcement learning

The authors introduce a masking technique that avoids penalizing trajectories exceeding the training-time turn limit by masking their advantages during policy updates. This enables test-time scaling where models trained with only 6 turns can naturally extend to tens of turns at inference while maintaining training efficiency.