AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

large language modelLLM-based agentdecision-making

Training LLM agents for complex multi-turn decision-making tasks requires extensive exploration within their environment, with reinforcement learning (RL) as a natural way. However, the open-source community currently lacks a unified RL framework capable of training agents from scratch across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a modular and decoupled framework specifically designed for RL-based agent in multi-turn decision-making tasks. It offers high flexibility and extensibility, supports mainstream RL algorithms, and spans a broad range of real-world scenarios. To effectively train agents for challenging tasks, we argue that they are required to expand external interactions with the environment, rather than relying solely on internal reasoning. Nevertheless, training agents for long-horizon interaction with vanilla methods often faces challenges like training instability. To this end, we propose ScalingInter-RL, a staged training approach for stable long-horizon RL training. It starts with short-horizon interaction to establish foundational policies and progressively expands them to encourage deeper exploration. Extensive experiments show that agents trained with our method achieve performance on par with—or even surpass—commercial counterparts like OpenAI o3 and Gemini-2.5-Pro across 27 tasks in diverse environments. We share key insights and will release the full framework, including code and datasets, to empower the community in building the next generation of intelligent agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces AgentGym-RL, a modular framework for training LLM agents via multi-turn reinforcement learning, and ScalingInter-RL, a staged training approach that progressively expands interaction horizons. It resides in the 'General-Purpose Multi-Turn RL Frameworks' leaf, which contains four papers including the original work. This leaf sits within the broader 'Core Multi-Turn RL Training Frameworks and Algorithms' branch, indicating a moderately populated research direction focused on foundational infrastructure rather than domain-specific applications. The taxonomy shows this is an active but not overcrowded area, with sibling papers exploring similar unified training infrastructures.

The taxonomy reveals neighboring leaves addressing hierarchical architectures, tree search methods, and policy optimization algorithms—all within the same parent branch. These directions share the goal of enabling stable multi-turn learning but diverge in their mechanisms: hierarchical methods decompose planning from execution, tree search approaches integrate explicit lookahead, and policy optimization focuses on gradient stability. The framework's position suggests it aims for generality across environments, contrasting with the 'Domain-Specific Multi-Turn Agent Applications' branch that tailors methods to web navigation, code generation, or multi-modal tasks. The taxonomy's scope and exclude notes clarify that AgentGym-RL's modularity distinguishes it from domain-restricted or search-intensive alternatives.

Among thirty candidates examined, the AgentGym-RL framework contribution shows no clear refutation across ten candidates, suggesting limited direct overlap in the sampled literature. However, the ScalingInter-RL staged training approach encountered three refutable candidates among ten examined, indicating that progressive horizon expansion or curriculum-based training has prior instantiations in the limited search scope. The third contribution—demonstrating that scaling interactions outperforms scaling model size—found no refutations across ten candidates, though this may reflect the specific framing rather than exhaustive coverage. The analysis highlights that while the framework appears relatively novel within the examined set, the staged training concept has more substantial prior work.

Given the limited search scope of thirty semantically similar candidates, this assessment captures local novelty rather than field-wide originality. The framework's modularity and the interaction-scaling insight appear less contested in the sampled literature, while the staged training approach overlaps with existing curriculum or progressive methods. The taxonomy context suggests the work occupies a moderately active niche, contributing infrastructure that complements rather than displaces existing hierarchical or search-based approaches. A broader literature review would be needed to assess whether the framework's specific design choices or the interaction-scaling claim represent substantive advances beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: training LLM agents for long-horizon decision making via multi-turn reinforcement learning. The field has evolved into a rich landscape organized around several complementary directions. At its heart lie core multi-turn RL training frameworks and algorithms, which provide general-purpose infrastructures for iterative policy improvement across extended interaction sequences—works such as AgentGym-RL[0], SkyRL-Agent[9], and RAGEN[15] exemplify this foundational branch. Alongside these frameworks, specialized training techniques and optimizations refine credit assignment, reward shaping, and policy modulation to handle the unique challenges of long horizons. Domain-specific multi-turn agent applications translate these methods into concrete settings like web navigation, software engineering, and interactive planning, while memory and knowledge augmentation branches address the need to maintain and retrieve context over many turns. Theoretical foundations and survey studies offer broader perspectives on the emerging paradigm, and reasoning efficiency with test-time optimization explores how agents can plan and search more effectively during deployment. Cross-domain and auxiliary applications round out the taxonomy by connecting multi-turn RL to related problems beyond the core task. Within this landscape, a particularly active line of work focuses on general-purpose multi-turn RL frameworks that balance scalability with sample efficiency. AgentGym-RL[0] sits squarely in this cluster, emphasizing a unified training infrastructure that supports diverse environments and iterative policy updates. Nearby efforts like AgentGym-RL Multi-Turn[1] and SkyRL-Agent[9] share a similar emphasis on flexible, environment-agnostic training loops, though they may differ in how they handle trajectory rollout or reward aggregation across turns. In contrast, RAGEN[15] and Tree Search Agent RL[3] incorporate more explicit search or planning mechanisms during training, trading off some generality for tighter integration of lookahead reasoning. A central open question across these branches is how to efficiently propagate credit over dozens of turns without overwhelming computational budgets or destabilizing learning dynamics. AgentGym-RL[0] addresses this by providing modular components for multi-turn rollout and policy optimization, positioning itself as a practical toolkit that complements more search-intensive approaches like Tree Search Agent RL[3] while remaining accessible to a broad range of downstream applications.

Claimed Contributions

AgentGym-RL framework for multi-turn RL-based agent training

10 retrieved papers

A unified, open-source reinforcement learning framework with modular architecture that supports mainstream RL algorithms and spans diverse real-world scenarios including web navigation, deep search, digital games, embodied tasks, and scientific tasks. The framework enables training LLM agents from scratch across heterogeneous environments with high flexibility and extensibility.

10 retrieved papers

ScalingInter-RL staged training approach

Can Refute

10 retrieved papers

A progressive interaction-scaling method that starts with short-horizon interactions to establish foundational policies and gradually expands them to encourage deeper exploration. This approach addresses training instability in long-horizon RL by balancing exploitation and exploration through a monotonic schedule that increases maximum interaction turns during training phases.

10 retrieved papers

Can Refute

Demonstration that scaling interactions outperforms scaling model size

10 retrieved papers

The work establishes through experiments that increasing post-training and test-time interactions with the environment provides better performance gains than simply increasing model parameters. A 7B parameter model trained with their method achieves results on par with or surpassing much larger commercial models like OpenAI o3 and Gemini-2.5-Pro across 27 tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning PDF

Xi, Zhiheng, Liao ChenYang, Guo Hong-lin, Liu Jiaqi, Zheng Rui, Ye, Junjie, Zhang, Jiazheng, Chen Wen-xiang, He Wei, Ding Yiwen, Li, Guanyu, Chen Ze-hui, Du Zhengyin, Yao Xuesong, Xu, Yufei, Chen, Jiecao, Gui, Tao, Wu, Zuxuan, Zhang Qi, Huang, Xuanjing, Jiang, Yu-Gang (2025) • arXiv.org

[9] SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent PDF

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica (2025)

[15] Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning PDF

Wang Zihan, Wang Kangrui, Zhang, Pingyue, Li, Linjie, Yang, Zhengyuan, Jin Xing, Yu Kefan, Nguyen Minh Nhat, Liu, Licheng, Gottlieb, Eli, Lu Yi-Ping, Cho, Kyunghyun, Wu, Jiajun, Fei Fei Li, Wang Li-juan, Choi, Yejin, Li Manling (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AgentGym-RL framework for multi-turn RL-based agent training

[4] Rlfactory: A plug-and-play reinforcement learning post-training framework for llm multi-turn tool-use PDF

Cannot Refute

[6] Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning PDF

Cannot Refute

[11] Reinforcement Learning for Long-Horizon Interactive LLM Agents PDF

Cannot Refute

[13] Multi-turn reinforcement learning with preference human feedback PDF

Cannot Refute

[15] Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning PDF

Cannot Refute

[24] Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks PDF

Cannot Refute

[51] Deep reinforcement learning for robotics: A survey of real-world successes PDF

Cannot Refute

[52] Goal-guided reinforcement learning: Leveraging large language models for long-horizon task decomposition PDF

Cannot Refute

[53] LLM-Guided Reinforcement Learning for Interactive Environments PDF

Cannot Refute

[54] Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents PDF

Cannot Refute

Contribution

ScalingInter-RL staged training approach

[1] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning PDF

Can Refute

[2] Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning PDF

Can Refute

[71] Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning PDF

Can Refute

[65] STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models PDF

Cannot Refute

[66] V-Thinker: Interactive Thinking with Images PDF

Cannot Refute

[67] Multi-Level Progressive Reinforcement Learning for Control Policy in Physical Simulations PDF

Cannot Refute

[68] Autonomous morphing strategy for a long-range aircraft using reinforcement learning PDF

Cannot Refute

[69] â¦ Learning-Based Control for Grid-Forming Inverters: Real-Time Adaptive Voltage Regulation, Multi-Level Disturbance Rejection, and Lyapunov-Based Stability PDF

Cannot Refute

[70] Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers PDF

Cannot Refute

[72] Rethinking urban water network design: A reinforcement learning framework for long-term flexible planning PDF

Cannot Refute

Contribution

Demonstration that scaling interactions outperforms scaling model size

[55] Pretraining representations for data-efficient reinforcement learning PDF

Cannot Refute

[56] Scaling laws for single-agent reinforcement learning PDF

Cannot Refute

[57] Parameter, experience, and compute efficient deep reinforcement learning PDF

Cannot Refute

[58] Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning PDF

Cannot Refute

[59] Scaling laws for a multi-agent reinforcement learning model PDF

Cannot Refute

[60] Agent57: Outperforming the atari human benchmark PDF

Cannot Refute

[61] Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction PDF

Cannot Refute

[62] Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies PDF

Cannot Refute

[63] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling PDF

Cannot Refute

[64] Enhancing design and evaluation methods for Reinforcement Learning-based congestion control: a large scale experimental study of fairness, efficiency â¦ PDF

Cannot Refute

AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning PDF

[9] SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent PDF

[15] Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning PDF

Contribution Analysis

AgentGym-RL framework for multi-turn RL-based agent training

[4] Rlfactory: A plug-and-play reinforcement learning post-training framework for llm multi-turn tool-use PDF

[6] Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning PDF

[11] Reinforcement Learning for Long-Horizon Interactive LLM Agents PDF

[13] Multi-turn reinforcement learning with preference human feedback PDF

[15] Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning PDF

[24] Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks PDF

[51] Deep reinforcement learning for robotics: A survey of real-world successes PDF

[52] Goal-guided reinforcement learning: Leveraging large language models for long-horizon task decomposition PDF

[53] LLM-Guided Reinforcement Learning for Interactive Environments PDF

[54] Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents PDF

ScalingInter-RL staged training approach

[1] AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning PDF

[2] Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning PDF

[71] Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning PDF

[65] STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models PDF

[66] V-Thinker: Interactive Thinking with Images PDF

[67] Multi-Level Progressive Reinforcement Learning for Control Policy in Physical Simulations PDF

[68] Autonomous morphing strategy for a long-range aircraft using reinforcement learning PDF

[69] â¦ Learning-Based Control for Grid-Forming Inverters: Real-Time Adaptive Voltage Regulation, Multi-Level Disturbance Rejection, and Lyapunov-Based Stability PDF

[70] Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers PDF

[72] Rethinking urban water network design: A reinforcement learning framework for long-term flexible planning PDF

Demonstration that scaling interactions outperforms scaling model size

[55] Pretraining representations for data-efficient reinforcement learning PDF

[56] Scaling laws for single-agent reinforcement learning PDF

[57] Parameter, experience, and compute efficient deep reinforcement learning PDF

[58] Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning PDF

[59] Scaling laws for a multi-agent reinforcement learning model PDF

[60] Agent57: Outperforming the atari human benchmark PDF

[61] Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction PDF

[62] Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies PDF

[63] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling PDF

[64] Enhancing design and evaluation methods for Reinforcement Learning-based congestion control: a large scale experimental study of fairness, efficiency â¦ PDF

Table of Contents

[69] â¦ Learning-Based Control for Grid-Forming Inverters: Real-Time Adaptive Voltage Regulation, Multi-Level Disturbance Rejection, and Lyapunov-Based Stability PDF

[64] Enhancing design and evaluation methods for Reinforcement Learning-based congestion control: a large scale experimental study of fairness, efficiency â¦ PDF