Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement learningPolicy optimizationLong-horizon agentHierarchical group
Abstract:

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchical-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historic contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
33
3
Claimed Contributions
14
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: stepwise group-based policy optimization for long-horizon agentic tasks. This field addresses the challenge of training agents to solve extended, multi-step problems by organizing trajectories into meaningful groups and assigning credit at intermediate stages. The taxonomy reveals several complementary perspectives: some branches focus on fine-grained stepwise credit assignment and advantage estimation (e.g., Hierarchy Groups Policy[0], Graph-Enhanced Policy[15]), while others emphasize trajectory-level group optimization (e.g., TGRPO[3], Group in Group[1]) that treats entire rollouts or sub-sequences as units. Multi-turn and interactive agent training branches (e.g., Multi-turn RLHF[2], Multi-Turn Tree Search[7]) capture conversational or iterative settings, whereas embodied and task planning agents (e.g., Embodied Task Planning[10], Multi-step Object Manipulation[18]) deal with physical or simulation environments. Parallel to these, multi-agent coordination branches (e.g., Strategic Coordination Evolving[4], Multi-agent Infrastructure Management[5]) and centralized training with decentralized execution methods (e.g., KOMA[8], Rainbow Fusion MADDPG[25]) explore how multiple agents can learn jointly or independently. Model-based and value-guided approaches (e.g., Sequential World Models[11], Multi-step Plan Value[33]) leverage predictive models or value functions to guide long-horizon planning, while hierarchical and multi-level optimization branches (e.g., Divide Conquer Pathfinding[12], Skill Augmentation Multi-Step[30]) decompose tasks into nested subgoals. A particularly active theme is the tension between fine-grained stepwise credit and coarser trajectory-level grouping: works like TGRPO[3] and Group Turn Policy[28] aggregate rewards over entire sequences, whereas Hierarchy Groups Policy[0] and Graph-Enhanced Policy[15] introduce hierarchical or context-aware structures to assign advantages at multiple granularities. The original paper, Hierarchy Groups Policy[0], sits squarely within the stepwise credit assignment branch, emphasizing context-aware hierarchical advantage estimation. Compared to Graph-Enhanced Policy[15], which also refines credit assignment through relational structure, Hierarchy Groups Policy[0] appears to focus more explicitly on nested groupings that reflect task decomposition. This contrasts with trajectory-level methods like TGRPO[3], which optimize over entire rollouts without intermediate hierarchical breakdowns. Open questions remain around balancing the computational overhead of fine-grained credit with the sample efficiency gains it may provide, and how best to integrate hierarchical advantage estimation with multi-agent or model-based planning paradigms.

Claimed Contributions

Revealing context inconsistency in stepwise group-based RL

The authors identify and empirically demonstrate a fundamental problem called context inconsistency that occurs when steps within the same group have different historical contexts during stepwise advantage estimation. They show this issue causes severely biased advantage estimation and degrades policy optimization in long-horizon agentic tasks.

1 retrieved paper
Hierarchy-of-Groups Policy Optimization (HGPO) algorithm

The authors introduce HGPO, a novel reinforcement learning algorithm that addresses context inconsistency through two key components: context-aware hierarchical grouping that assigns steps to multiple hierarchical groups based on historical context consistency, and adaptive weighting advantage estimation that aggregates group advantages with weights favoring more consistent contexts.

3 retrieved papers
State-of-the-art empirical performance on agentic benchmarks

The authors demonstrate that HGPO achieves superior performance on ALFWorld and WebShop benchmarks using Qwen2.5 models, consistently outperforming existing methods while maintaining identical GPU memory usage and LLM rollouts with minimal additional time cost.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Revealing context inconsistency in stepwise group-based RL

The authors identify and empirically demonstrate a fundamental problem called context inconsistency that occurs when steps within the same group have different historical contexts during stepwise advantage estimation. They show this issue causes severely biased advantage estimation and degrades policy optimization in long-horizon agentic tasks.

Contribution

Hierarchy-of-Groups Policy Optimization (HGPO) algorithm

The authors introduce HGPO, a novel reinforcement learning algorithm that addresses context inconsistency through two key components: context-aware hierarchical grouping that assigns steps to multiple hierarchical groups based on historical context consistency, and adaptive weighting advantage estimation that aggregates group advantages with weights favoring more consistent contexts.

Contribution

State-of-the-art empirical performance on agentic benchmarks

The authors demonstrate that HGPO achieves superior performance on ALFWorld and WebShop benchmarks using Qwen2.5 models, consistently outperforming existing methods while maintaining identical GPU memory usage and LLM rollouts with minimal additional time cost.