Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Reinforcement learningPolicy optimizationLong-horizon agentHierarchical group

Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchical-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historic contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: stepwise group-based policy optimization for long-horizon agentic tasks. This field addresses the challenge of training agents to solve extended, multi-step problems by organizing trajectories into meaningful groups and assigning credit at intermediate stages. The taxonomy reveals several complementary perspectives: some branches focus on fine-grained stepwise credit assignment and advantage estimation (e.g., Hierarchy Groups Policy[0], Graph-Enhanced Policy[15]), while others emphasize trajectory-level group optimization (e.g., TGRPO[3], Group in Group[1]) that treats entire rollouts or sub-sequences as units. Multi-turn and interactive agent training branches (e.g., Multi-turn RLHF[2], Multi-Turn Tree Search[7]) capture conversational or iterative settings, whereas embodied and task planning agents (e.g., Embodied Task Planning[10], Multi-step Object Manipulation[18]) deal with physical or simulation environments. Parallel to these, multi-agent coordination branches (e.g., Strategic Coordination Evolving[4], Multi-agent Infrastructure Management[5]) and centralized training with decentralized execution methods (e.g., KOMA[8], Rainbow Fusion MADDPG[25]) explore how multiple agents can learn jointly or independently. Model-based and value-guided approaches (e.g., Sequential World Models[11], Multi-step Plan Value[33]) leverage predictive models or value functions to guide long-horizon planning, while hierarchical and multi-level optimization branches (e.g., Divide Conquer Pathfinding[12], Skill Augmentation Multi-Step[30]) decompose tasks into nested subgoals. A particularly active theme is the tension between fine-grained stepwise credit and coarser trajectory-level grouping: works like TGRPO[3] and Group Turn Policy[28] aggregate rewards over entire sequences, whereas Hierarchy Groups Policy[0] and Graph-Enhanced Policy[15] introduce hierarchical or context-aware structures to assign advantages at multiple granularities. The original paper, Hierarchy Groups Policy[0], sits squarely within the stepwise credit assignment branch, emphasizing context-aware hierarchical advantage estimation. Compared to Graph-Enhanced Policy[15], which also refines credit assignment through relational structure, Hierarchy Groups Policy[0] appears to focus more explicitly on nested groupings that reflect task decomposition. This contrasts with trajectory-level methods like TGRPO[3], which optimize over entire rollouts without intermediate hierarchical breakdowns. Open questions remain around balancing the computational overhead of fine-grained credit with the sample efficiency gains it may provide, and how best to integrate hierarchical advantage estimation with multi-agent or model-based planning paradigms.

Claimed Contributions

Revealing context inconsistency in stepwise group-based RL

1 retrieved paper

The authors identify and empirically demonstrate a fundamental problem called context inconsistency that occurs when steps within the same group have different historical contexts during stepwise advantage estimation. They show this issue causes severely biased advantage estimation and degrades policy optimization in long-horizon agentic tasks.

1 retrieved paper

Hierarchy-of-Groups Policy Optimization (HGPO) algorithm

3 retrieved papers

The authors introduce HGPO, a novel reinforcement learning algorithm that addresses context inconsistency through two key components: context-aware hierarchical grouping that assigns steps to multiple hierarchical groups based on historical context consistency, and adaptive weighting advantage estimation that aggregates group advantages with weights favoring more consistent contexts.

3 retrieved papers

State-of-the-art empirical performance on agentic benchmarks

Can Refute

10 retrieved papers

The authors demonstrate that HGPO achieves superior performance on ALFWorld and WebShop benchmarks using Qwen2.5 models, consistently outperforming existing methods while maintaining identical GPU memory usage and LLM rollouts with minimal additional time cost.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] Graph-Enhanced Policy Optimization in LLM Agent Training PDF

Zhao Wei, Jiazhen Yuan, Wei Zhao, Zhengbiao Bai (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Revealing context inconsistency in stepwise group-based RL

[34] Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning PDF

Cannot Refute

Contribution

Hierarchy-of-Groups Policy Optimization (HGPO) algorithm

[43] Hybrid group relative policy optimization: A multi-sample approach to enhancing policy optimization PDF

Cannot Refute

[44] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF

Cannot Refute

[45] Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization PDF

Cannot Refute

Contribution

State-of-the-art empirical performance on agentic benchmarks

[10] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning PDF

Can Refute

[36] Agentic Reinforcement Learning with Implicit Step Rewards PDF

Can Refute

[41] RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents PDF

Can Refute

[9] Seea-r1: Tree-structured reinforcement fine-tuning for self-evolving embodied agents PDF

Cannot Refute

[35] Agent q: Advanced reasoning and learning for autonomous ai agents PDF

Cannot Refute

[37] Offline Reinforcement Learning for LLM Multi-Step Reasoning PDF

Cannot Refute

[38] Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains PDF

Cannot Refute

[39] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training PDF

Cannot Refute

[40] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents PDF

Cannot Refute

[42] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning PDF

Cannot Refute

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] Graph-Enhanced Policy Optimization in LLM Agent Training PDF

Contribution Analysis

Revealing context inconsistency in stepwise group-based RL

[34] Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning PDF

Hierarchy-of-Groups Policy Optimization (HGPO) algorithm

[43] Hybrid group relative policy optimization: A multi-sample approach to enhancing policy optimization PDF

[44] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF

[45] Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization PDF

State-of-the-art empirical performance on agentic benchmarks

[10] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning PDF

[36] Agentic Reinforcement Learning with Implicit Step Rewards PDF

[41] RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents PDF

[9] Seea-r1: Tree-structured reinforcement fine-tuning for self-evolving embodied agents PDF

[35] Agent q: Advanced reasoning and learning for autonomous ai agents PDF

[37] Offline Reinforcement Learning for LLM Multi-Step Reasoning PDF

[38] Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains PDF

[39] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training PDF

[40] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents PDF

[42] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning PDF

Table of Contents