AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
MLLMsVisual ToolsReinforcement Learning
Abstract:

While augmenting Multimodal Large Language Models (MLLMs) with tools is a promising direction, current approaches face critical limitations. They often rely on single, atomic tools, failing to address the challenges of multi-turn planning, and they do not equip models with the ability to select effective tool combinations for complex tasks. To overcome these limitations, we introduce AdaReasoner, a framework that teaches models to perform dynamic tool orchestration for iterative visual reasoning. Our paradigm is designed to support a broad spectrum of tools, including computationally intensive, expert-model-based services. It features a comprehensive design that includes a new data curation methodology and a tailored Tool GRPO algorithm to optimize multi-turn tool-calling trajectories, which yields state-of-the-art models that achieve substantial gains over their baselines (+38.7% average on 7B) and reach near-perfect accuracy on complex benchmarks like Visual Spatial Planning (97.6%). This performance surpasses leading proprietary systems such as GPT-5 and Claude Sonnet 4, demonstrating that our approach can effectively overcome scale-based limitations by augmenting smaller models with powerful tool-use capabilities. Critically, we find that AdaReasoner develops emergent, self-adaptive behaviors: it learns to autonomously adopt beneficial tools, discard irrelevant ones, and modulate its usage frequency. This ability to curate its own optimal problem-solving strategies represents a significant step toward building more robust, scalable, and reliable reasoning agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: dynamic tool orchestration for iterative visual reasoning. The field addresses how agents can adaptively select and coordinate external tools—ranging from vision modules and code interpreters to domain-specific APIs—to solve complex visual tasks that require multiple reasoning steps. The taxonomy reveals several major branches: Reinforcement Learning-Based Tool Selection and Orchestration explores end-to-end RL methods that learn which tools to invoke and when; Multi-Agent Collaboration and Orchestration examines systems where multiple specialized agents coordinate their capabilities; Supervised and Hybrid Tool Integration focuses on training regimes that combine demonstration data with learned policies; Adaptive Visual Attention and Perception investigates how agents dynamically adjust their perceptual focus; Domain-Specific Tool-Augmented Reasoning targets applications in medicine, agriculture, and other specialized fields; Workflow Automation and Interface Interaction deals with GUI agents and process automation; Hierarchical and Self-Organizing Agent Architectures studies modular designs that decompose tasks; Supporting Infrastructure and Educational Tools provides benchmarks and teaching frameworks; and Specialized Reasoning and Optimization Tasks covers niche problem settings. Representative works such as MMCTAgent[3] and VisualToolAgent[8] illustrate how tool libraries can be integrated into reasoning pipelines, while approaches like Ego-R1[4] and PixelCraft[5] demonstrate diverse strategies for managing iterative perception and action. A particularly active line of work centers on end-to-end RL for visual tool use, where agents learn orchestration policies directly from task rewards rather than relying solely on supervised demonstrations. AdaReasoner[0] exemplifies this direction by training an RL-based controller that iteratively selects tools to refine visual understanding, closely aligning with OpenThinking[1] and Chain-of-Focus[2], which similarly emphasize learned decision-making over fixed pipelines. In contrast, VTool-R1[17] and VisualToolAgent[8] blend RL with more structured reasoning traces, highlighting a trade-off between flexibility and interpretability. AdaReasoner[0] distinguishes itself by focusing on adaptive iteration—dynamically deciding when to invoke perception modules versus reasoning steps—whereas Chain-of-Focus[2] prioritizes attention mechanisms and OpenThinking[1] explores transparent reasoning chains. These differences reflect broader questions in the field: how much structure should be imposed on tool selection, whether to optimize end-to-end or modularize components, and how to balance sample efficiency with generalization across diverse visual tasks.

Claimed Contributions

AdaReasoner framework for dynamic tool orchestration

The authors propose a comprehensive framework that enables multimodal large language models to dynamically select and combine tools for complex visual reasoning tasks. The framework includes a data curation methodology for multi-turn tool planning and a tailored Tool GRPO algorithm to optimize multi-turn tool-calling trajectories.

10 retrieved papers
Data curation methodology for multi-turn tool planning

The authors introduce a three-stage data curation process that generates high-quality, human-like reasoning trajectories. This methodology deliberately incorporates reflection and backtracking scenarios, as well as explicit tool failure cases, to teach models robust problem-solving strategies beyond simply following optimal paths.

10 retrieved papers
Tool GRPO algorithm for multi-turn tool interaction

The authors develop an adaptive reinforcement learning paradigm that extends the GRPO framework to handle multi-turn tool-calling scenarios. This includes multi-turn reward accumulation and an adaptive reward mechanism with asymmetric incentive structure to guide models in learning when and how to use tools effectively.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AdaReasoner framework for dynamic tool orchestration

The authors propose a comprehensive framework that enables multimodal large language models to dynamically select and combine tools for complex visual reasoning tasks. The framework includes a data curation methodology for multi-turn tool planning and a tailored Tool GRPO algorithm to optimize multi-turn tool-calling trajectories.

Contribution

Data curation methodology for multi-turn tool planning

The authors introduce a three-stage data curation process that generates high-quality, human-like reasoning trajectories. This methodology deliberately incorporates reflection and backtracking scenarios, as well as explicit tool failure cases, to teach models robust problem-solving strategies beyond simply following optimal paths.

Contribution

Tool GRPO algorithm for multi-turn tool interaction

The authors develop an adaptive reinforcement learning paradigm that extends the GRPO framework to handle multi-turn tool-calling scenarios. This includes multi-turn reward accumulation and an adaptive reward mechanism with asymmetric incentive structure to guide models in learning when and how to use tools effectively.