Understanding Tool-Integrated Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsTool-Integrated ReasoningReinforcement LearningAdvantage Shaping Policy Optimization
Abstract:

We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@kk metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper contributes a formal proof that tool-integrated reasoning fundamentally expands LLM capabilities by enabling strict support expansion, alongside a novel training algorithm (ASPO) and empirical analysis on mathematical benchmarks. It resides in the 'Formal Theory and Capability Analysis' leaf, which contains only two papers total. This represents a sparse research direction within the broader taxonomy: while the field includes six major branches addressing tool integration from training methods to domain applications, formal theoretical analysis of capability boundaries remains relatively underexplored compared to empirical or application-focused work.

The taxonomy reveals substantial activity in neighboring areas. The parent branch 'Theoretical Foundations and Frameworks' also includes comprehensive surveys (six papers) and context engineering studies (two papers), indicating that while broad reviews exist, rigorous formal theory is less common. Adjacent branches show dense clusters in 'Training and Optimization Methodologies' (especially reinforcement learning with seven papers) and 'Domain-Specific Applications' (particularly mathematical reasoning with six papers). The paper bridges these areas by providing theoretical grounding for why tool integration works, then validating through mathematical problem-solving experiments, connecting formal analysis to practical training concerns.

Among the three contributions analyzed, the formal proof examined ten candidates with zero refutations, and the ASPO algorithm similarly showed no clear prior work among ten candidates examined. The empirical analysis contribution, however, examined seven candidates and found one refutable match, suggesting some overlap with existing experimental studies on tool-augmented mathematical reasoning. This pattern indicates the theoretical contributions appear more novel within the limited search scope (27 total candidates), while the empirical validation aligns with established benchmarking practices in the mathematical reasoning subfield, which already contains multiple tool-integration studies.

Based on examination of 27 semantically similar candidates, the work's theoretical core appears distinctive within current literature, though the empirical component shows expected overlap with mathematical reasoning benchmarks. The analysis covers top-K semantic matches and does not represent exhaustive review of all tool-integration research. The sparse population of the formal theory leaf (two papers) versus dense mathematical application clusters (six papers) suggests the field has prioritized practical demonstrations over foundational capability analysis, positioning this work's theoretical contributions in relatively open territory.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Tool-integrated reasoning in large language models. The field has organized itself around six major branches that collectively address how LLMs can leverage external tools to enhance their reasoning capabilities. Theoretical Foundations and Frameworks (where Tool-Integrated Reasoning[0] resides) examine formal properties and capability boundaries, often analyzing when and why tool use succeeds or fails. Training and Optimization Methodologies explore techniques such as reinforcement learning and self-training to improve tool-calling behavior, while Tool Creation and Management investigates how models can autonomously generate or select appropriate tools. Domain-Specific Applications demonstrate tool integration in areas like mathematics, chemistry, and healthcare, Advanced Reasoning Architectures propose novel multi-step or agentic designs, and Evaluation and Benchmarking develop metrics and testbeds to measure tool-use proficiency. Together, these branches reflect a progression from understanding core principles to deploying practical systems across diverse problem settings. Several active lines of work reveal key trade-offs and open questions. Mathematical reasoning has attracted substantial attention, with approaches like ToRA[8] and Qwen Math[22] integrating symbolic computation tools, while domain applications such as Chemistry Tools[3] and EHRAgent[20] tailor tool suites to specialized knowledge. A recurring theme is the tension between model autonomy and reliability: works like Tool Makers[4] and CREATOR[17] enable models to craft their own tools, yet studies such as Tool-Induced Hallucinations[12] highlight risks when models misuse or over-rely on external calls. Tool-Integrated Reasoning[0] sits within the Theoretical Foundations branch alongside Tool-Induced Hallucinations[12], focusing on formal capability analysis rather than empirical deployment. Compared to Meta-reasoning[5], which examines higher-order reasoning strategies, Tool-Integrated Reasoning[0] emphasizes the foundational question of what tool integration can and cannot achieve, providing a conceptual anchor for understanding the limits and potential of this rapidly evolving paradigm.

Claimed Contributions

Formal proof that TIR expands LLM capabilities via strict support expansion

The authors prove theoretically that integrating external tools (like Python interpreters) strictly expands both the empirical support and feasible support of language models compared to pure-text models. This expansion breaks the capability ceiling by enabling problem-solving strategies that are impossible or intractably verbose for text-only models.

10 retrieved papers
Advantage Shaping Policy Optimization (ASPO) algorithm

The authors propose ASPO, a new training algorithm that shapes model behavior by directly modifying the advantage function rather than the reward function. This approach overcomes training instability issues in GRPO-like algorithms when trying to guide behaviors such as earlier tool invocation.

10 retrieved papers
Comprehensive empirical analysis revealing emergent cognitive patterns in TIR

The authors conduct extensive experiments on mathematical benchmarks that validate their theoretical claims and identify three emergent cognitive patterns showing how models learn to think with tools: insight-to-computation transformation, exploration and verification via code, and offloading complex calculations.

7 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formal proof that TIR expands LLM capabilities via strict support expansion

The authors prove theoretically that integrating external tools (like Python interpreters) strictly expands both the empirical support and feasible support of language models compared to pure-text models. This expansion breaks the capability ceiling by enabling problem-solving strategies that are impossible or intractably verbose for text-only models.

Contribution

Advantage Shaping Policy Optimization (ASPO) algorithm

The authors propose ASPO, a new training algorithm that shapes model behavior by directly modifying the advantage function rather than the reward function. This approach overcomes training instability issues in GRPO-like algorithms when trying to guide behaviors such as earlier tool invocation.

Contribution

Comprehensive empirical analysis revealing emergent cognitive patterns in TIR

The authors conduct extensive experiments on mathematical benchmarks that validate their theoretical claims and identify three emergent cognitive patterns showing how models learn to think with tools: insight-to-computation transformation, exploration and verification via code, and offloading complex calculations.