AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

MLLMsVisual ToolsReinforcement Learning

While augmenting Multimodal Large Language Models (MLLMs) with tools is a promising direction, current approaches face critical limitations. They often rely on single, atomic tools, failing to address the challenges of multi-turn planning, and they do not equip models with the ability to select effective tool combinations for complex tasks. To overcome these limitations, we introduce AdaReasoner, a framework that teaches models to perform dynamic tool orchestration for iterative visual reasoning. Our paradigm is designed to support a broad spectrum of tools, including computationally intensive, expert-model-based services. It features a comprehensive design that includes a new data curation methodology and a tailored Tool GRPO algorithm to optimize multi-turn tool-calling trajectories, which yields state-of-the-art models that achieve substantial gains over their baselines (+38.7% average on 7B) and reach near-perfect accuracy on complex benchmarks like Visual Spatial Planning (97.6%). This performance surpasses leading proprietary systems such as GPT-5 and Claude Sonnet 4, demonstrating that our approach can effectively overcome scale-based limitations by augmenting smaller models with powerful tool-use capabilities. Critically, we find that AdaReasoner develops emergent, self-adaptive behaviors: it learns to autonomously adopt beneficial tools, discard irrelevant ones, and modulate its usage frequency. This ability to curate its own optimal problem-solving strategies represents a significant step toward building more robust, scalable, and reliable reasoning agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: dynamic tool orchestration for iterative visual reasoning. The field addresses how agents can adaptively select and coordinate external tools—ranging from vision modules and code interpreters to domain-specific APIs—to solve complex visual tasks that require multiple reasoning steps. The taxonomy reveals several major branches: Reinforcement Learning-Based Tool Selection and Orchestration explores end-to-end RL methods that learn which tools to invoke and when; Multi-Agent Collaboration and Orchestration examines systems where multiple specialized agents coordinate their capabilities; Supervised and Hybrid Tool Integration focuses on training regimes that combine demonstration data with learned policies; Adaptive Visual Attention and Perception investigates how agents dynamically adjust their perceptual focus; Domain-Specific Tool-Augmented Reasoning targets applications in medicine, agriculture, and other specialized fields; Workflow Automation and Interface Interaction deals with GUI agents and process automation; Hierarchical and Self-Organizing Agent Architectures studies modular designs that decompose tasks; Supporting Infrastructure and Educational Tools provides benchmarks and teaching frameworks; and Specialized Reasoning and Optimization Tasks covers niche problem settings. Representative works such as MMCTAgent[3] and VisualToolAgent[8] illustrate how tool libraries can be integrated into reasoning pipelines, while approaches like Ego-R1[4] and PixelCraft[5] demonstrate diverse strategies for managing iterative perception and action. A particularly active line of work centers on end-to-end RL for visual tool use, where agents learn orchestration policies directly from task rewards rather than relying solely on supervised demonstrations. AdaReasoner[0] exemplifies this direction by training an RL-based controller that iteratively selects tools to refine visual understanding, closely aligning with OpenThinking[1] and Chain-of-Focus[2], which similarly emphasize learned decision-making over fixed pipelines. In contrast, VTool-R1[17] and VisualToolAgent[8] blend RL with more structured reasoning traces, highlighting a trade-off between flexibility and interpretability. AdaReasoner[0] distinguishes itself by focusing on adaptive iteration—dynamically deciding when to invoke perception modules versus reasoning steps—whereas Chain-of-Focus[2] prioritizes attention mechanisms and OpenThinking[1] explores transparent reasoning chains. These differences reflect broader questions in the field: how much structure should be imposed on tool selection, whether to optimize end-to-end or modularize components, and how to balance sample efficiency with generalization across diverse visual tasks.

Claimed Contributions

AdaReasoner framework for dynamic tool orchestration

10 retrieved papers

The authors propose a comprehensive framework that enables multimodal large language models to dynamically select and combine tools for complex visual reasoning tasks. The framework includes a data curation methodology for multi-turn tool planning and a tailored Tool GRPO algorithm to optimize multi-turn tool-calling trajectories.

10 retrieved papers

Data curation methodology for multi-turn tool planning

10 retrieved papers

The authors introduce a three-stage data curation process that generates high-quality, human-like reasoning trajectories. This methodology deliberately incorporates reflection and backtracking scenarios, as well as explicit tool failure cases, to teach models robust problem-solving strategies beyond simply following optimal paths.

10 retrieved papers

Tool GRPO algorithm for multi-turn tool interaction

Can Refute

10 retrieved papers

The authors develop an adaptive reinforcement learning paradigm that extends the GRPO framework to handle multi-turn tool-calling scenarios. This includes multi-turn reward accumulation and an adaptive reward mechanism with asymmetric incentive structure to guide models in learning when and how to use tools effectively.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Openthinkimg: Learning to think with images via visual tool reinforcement learning PDF

Su, Zhaochen, Li, Linjie, Zhao-yu Su, Song Mingyang, Linjie Li, Mingyang Song, Yang, Zhengyuan, Yunzhuo Hao, Zhang Jun, Zhengyuan Yang, Chen, Guanjie, Jun Zhang, Gu Jiawei, Guanjie Chen, Li Juntao, Jiawei Gu, Qu, Xiaoye, Juntao Li, Cheng Yu, Xiaoye Qu, Yu Cheng (2025)

[2] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL PDF

Zhang Xintong, Gao Zhi, Xintong Zhang, Zhang Bo-fei, Zhi Gao, Li Pengxiang, Bofei Zhang, Zhang Xiaowen, Pengxiang Li, Liu Yang, Xiaowen Zhang, Yuan Tao, Yang Liu, Wu, Yuwei, Tao Yuan, Jia Yun-de, Yuwei Wu, Zhu, Song-Chun, Yunde Jia, Li Qing, Song-Chun Zhu, Qing Li (2025)

[8] Visualtoolagent (vista): A reinforcement learning framework for visual tool selection PDF

Huang Zeyi, Ji, Yuyang, Zeyi Huang, Yuyang Ji, Cai, Zefan, Anirudh Sundara Rajan, Xiao Wen, Zefan Cai, Hu Junjie, Wen Xiao, Lee Yong Jae, Junjie Hu, Yong Jae Lee (2025)

[17] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use PDF

Wu Mingyuan, Yang JingCheng, Mingyuan Wu, Jiang, Jize, Jingcheng Yang, Li, Meitang, Jize Jiang, Meitang Li, Yu, Hanchao, Kaizhuo Yan, Zhang, Minjia, Hanchao Yu, Zhai, ChengXiang, Minjia Zhang, Nahrstedt, Klara, Chengxiang Zhai, Klara Nahrstedt (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AdaReasoner framework for dynamic tool orchestration

[1] Openthinkimg: Learning to think with images via visual tool reinforcement learning PDF

Cannot Refute

[2] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL PDF

Cannot Refute

[3] Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning PDF

Cannot Refute

[5] PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images PDF

Cannot Refute

[51] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding PDF

Cannot Refute

[52] Deep research agents: A systematic examination and roadmap PDF

Cannot Refute

[53] Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent PDF

Cannot Refute

[54] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF

Cannot Refute

[55] Towards robust multi-modal reasoning via model selection PDF

Cannot Refute

[56] SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation PDF

Cannot Refute

Contribution

Data curation methodology for multi-turn tool planning

[57] Collecting metrics for continuous platform monitoring PDF

Cannot Refute

[58] Towards Standardization of GenAI-Driven Agentic Architectures for Radio Access Networks PDF

Cannot Refute

[59] Domaino1s: Guiding llm reasoning for explainable answers in high-stakes domains PDF

Cannot Refute

[60] A survey of reasoning and agentic systems in time series with large language models PDF

Cannot Refute

[61] Generator-assistant stepwise rollback framework for large language model agent PDF

Cannot Refute

[62] GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis PDF

Cannot Refute

[63] Systematic review of metadata-driven data orchestration in modern analytics engineering PDF

Cannot Refute

[64] Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis PDF

Cannot Refute

[65] Flexible and Reproducible RF Calibration using Google Cloud Workflows PDF

Cannot Refute

[66] Reflection-Driven Control for Trustworthy Code Agents PDF

Cannot Refute

Contribution

Tool GRPO algorithm for multi-turn tool interaction

[69] Reinforcing multi-turn reasoning in llm agents via turn-level reward design PDF

Can Refute

[74] Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment PDF

Can Refute

[49] Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning PDF

Cannot Refute

[67] ReTool: Reinforcement Learning for Strategic Tool Use in LLMs PDF

Cannot Refute

[68] Demystifying reinforcement learning in agentic reasoning PDF

Cannot Refute

[70] Reinforcement learning foundations for deep research systems: A survey PDF

Cannot Refute

[71] Fathom-deepresearch: Unlocking long horizon information retrieval and synthesis for slms PDF

Cannot Refute

[72] Agentic reinforced policy optimization PDF

Cannot Refute

[73] Steptool: A step-grained reinforcement learning framework for tool learning in llms PDF

Cannot Refute

[75] ToRL: Scaling Tool-Integrated RL PDF

Cannot Refute

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Openthinkimg: Learning to think with images via visual tool reinforcement learning PDF

[2] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL PDF

[8] Visualtoolagent (vista): A reinforcement learning framework for visual tool selection PDF

[17] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use PDF

Contribution Analysis

AdaReasoner framework for dynamic tool orchestration

[1] Openthinkimg: Learning to think with images via visual tool reinforcement learning PDF

[2] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL PDF

[3] Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning PDF

[5] PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images PDF

[51] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding PDF

[52] Deep research agents: A systematic examination and roadmap PDF

[53] Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent PDF

[54] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF

[55] Towards robust multi-modal reasoning via model selection PDF

[56] SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation PDF

Data curation methodology for multi-turn tool planning

[57] Collecting metrics for continuous platform monitoring PDF

[58] Towards Standardization of GenAI-Driven Agentic Architectures for Radio Access Networks PDF

[59] Domaino1s: Guiding llm reasoning for explainable answers in high-stakes domains PDF

[60] A survey of reasoning and agentic systems in time series with large language models PDF

[61] Generator-assistant stepwise rollback framework for large language model agent PDF

[62] GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis PDF

[63] Systematic review of metadata-driven data orchestration in modern analytics engineering PDF

[64] Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis PDF

[65] Flexible and Reproducible RF Calibration using Google Cloud Workflows PDF

[66] Reflection-Driven Control for Trustworthy Code Agents PDF

Tool GRPO algorithm for multi-turn tool interaction

[69] Reinforcing multi-turn reasoning in llm agents via turn-level reward design PDF

[74] Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment PDF

[49] Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning PDF

[67] ReTool: Reinforcement Learning for Strategic Tool Use in LLMs PDF

[68] Demystifying reinforcement learning in agentic reasoning PDF

[70] Reinforcement learning foundations for deep research systems: A survey PDF

[71] Fathom-deepresearch: Unlocking long horizon information retrieval and synthesis for slms PDF

[72] Agentic reinforced policy optimization PDF

[73] Steptool: A step-grained reinforcement learning framework for tool learning in llms PDF

[75] ToRL: Scaling Tool-Integrated RL PDF

Table of Contents