In-The-Flow Agentic System Optimization for Effective Planning and Tool Use

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningLarge Language ModelsAgentic SystemsTool UsePlanningOn-policy OptimizationSparse Rewards
Abstract:

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns. Codebase is available at https://anonymous.4open.science/r/agentflow.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

AgentFlow introduces a trainable agentic framework that coordinates four specialized modules (planner, executor, verifier, generator) through evolving memory and optimizes the planner within multi-turn interaction loops. The paper resides in the 'Policy Optimization for Multi-Turn Agentic Reasoning' leaf, which contains five papers total, indicating a moderately active but not overcrowded research direction. This leaf focuses specifically on RL algorithms that optimize agent policies across extended interaction horizons with tools and environments, distinguishing it from simpler single-turn or outcome-only RL approaches.

The taxonomy tree reveals that AgentFlow's leaf sits within the broader 'Reinforcement Learning for Agentic Systems' branch, which also includes sibling leaves on search/retrieval agents and preference-based optimization. Neighboring branches address complementary concerns: 'Agentic Framework Architectures' explores modular designs and tool integration patterns, while 'Specialized Domains' examines vertical applications. The scope note for the paper's leaf explicitly excludes single-turn methods, positioning AgentFlow's multi-turn credit assignment focus as a defining characteristic that separates it from adjacent work on static prompt-based systems or non-RL training paradigms.

Among 26 candidates examined across three contributions, no clearly refuting prior work was identified. The AgentFlow framework contribution examined six candidates with zero refutations, Flow-GRPO examined ten candidates with zero refutations, and the evaluation contribution examined ten candidates with zero refutations. This suggests that within the limited search scope—focused on top-K semantic matches and citation expansion—the specific combination of trainable in-the-flow coordination, group-refined policy optimization for multi-turn credit assignment, and trajectory-level outcome broadcasting appears relatively unexplored. However, the sibling papers in the same taxonomy leaf (four others) likely address overlapping themes in multi-turn policy optimization.

The analysis reflects a targeted literature search rather than exhaustive coverage, examining 26 candidates from a 50-paper taxonomy spanning 26 leaf nodes. While no direct refutations emerged within this scope, the presence of four sibling papers in the same leaf indicates that multi-turn agentic policy optimization is an established research direction. The novelty assessment is therefore constrained by the search methodology and would benefit from deeper examination of the sibling papers' specific technical approaches to credit assignment and modular coordination.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: trainable agentic system optimization for planning and tool use. The field organizes around several complementary perspectives. Reinforcement Learning for Agentic Systems explores policy optimization methods that enable agents to learn multi-turn reasoning and tool invocation strategies through trial and error, often leveraging process rewards or outcome-based signals. Agentic Framework Architectures and Design Patterns examines structural blueprints—such as reflection loops, hierarchical decomposition, and modular tool interfaces—that shape how agents coordinate planning with execution. Specialized Domains and Retrieval-Augmented approaches address vertical applications (e.g., urban logistics, manufacturing, medical imaging) and knowledge grounding techniques that enhance factual accuracy. Meanwhile, Evaluation, Benchmarking, and Error Analysis provides the empirical infrastructure to measure agent performance and diagnose failure modes, while Observability and Conceptual Foundations round out the taxonomy with operational tooling and theoretical surveys that contextualize the broader landscape. Within the reinforcement learning branch, a particularly active line of work focuses on policy optimization for multi-turn agentic reasoning, where agents must iteratively refine plans and select tools across extended interactions. InTheFlow Agentic[0] sits squarely in this cluster, emphasizing continuous policy improvement through feedback loops that adapt both planning heuristics and tool-use decisions. Nearby efforts such as Agentic Reinforced Policy[3] and Verltool[19] similarly explore how RL signals can be injected into agent workflows, though they differ in whether they prioritize end-to-end differentiable architectures or modular reward shaping. A recurring theme across these studies is the trade-off between sample efficiency and generalization: some methods rely on dense environment interactions to learn robust policies, while others incorporate pre-trained priors or human demonstrations to bootstrap learning. Open questions remain around credit assignment in long-horizon tasks and the scalability of RL-based optimization when tool libraries grow large or when domain shifts occur.

Claimed Contributions

AGENTFLOW: trainable in-the-flow agentic framework

AGENTFLOW is a trainable agentic system that coordinates four specialized modules (planner, executor, verifier, generator) via an evolving memory and directly optimizes the planner policy on-policy within the multi-turn interaction loop, enabling adaptive long-horizon planning and robust tool orchestration.

6 retrieved papers
Flow-GRPO: on-policy algorithm for multi-turn optimization

Flow-GRPO is an on-policy reinforcement learning algorithm that addresses long-horizon credit assignment by broadcasting a single verifiable trajectory-level outcome reward to every turn, transforming multi-turn RL into tractable single-turn policy updates with group-normalized advantages for stable training.

10 retrieved papers
Comprehensive evaluation demonstrating performance gains

The authors demonstrate through experiments on ten diverse reasoning benchmarks that AGENTFLOW with a 7B-scale backbone achieves substantial performance improvements over specialized baselines and larger proprietary models, with analyses revealing improved planning, enhanced tool-calling reliability, and positive scaling properties.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AGENTFLOW: trainable in-the-flow agentic framework

AGENTFLOW is a trainable agentic system that coordinates four specialized modules (planner, executor, verifier, generator) via an evolving memory and directly optimizes the planner policy on-policy within the multi-turn interaction loop, enabling adaptive long-horizon planning and robust tool orchestration.

Contribution

Flow-GRPO: on-policy algorithm for multi-turn optimization

Flow-GRPO is an on-policy reinforcement learning algorithm that addresses long-horizon credit assignment by broadcasting a single verifiable trajectory-level outcome reward to every turn, transforming multi-turn RL into tractable single-turn policy updates with group-normalized advantages for stable training.

Contribution

Comprehensive evaluation demonstrating performance gains

The authors demonstrate through experiments on ten diverse reasoning benchmarks that AGENTFLOW with a 7B-scale backbone achieves substantial performance improvements over specialized baselines and larger proprietary models, with analyses revealing improved planning, enhanced tool-calling reliability, and positive scaling properties.