Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors identify and empirically demonstrate a fundamental problem called context inconsistency that occurs when steps within the same group have different historical contexts during stepwise advantage estimation. They show this issue causes severely biased advantage estimation and degrades policy optimization in long-horizon agentic tasks.
The authors introduce HGPO, a novel reinforcement learning algorithm that addresses context inconsistency through two key components: context-aware hierarchical grouping that assigns steps to multiple hierarchical groups based on historical context consistency, and adaptive weighting advantage estimation that aggregates group advantages with weights favoring more consistent contexts.
The authors demonstrate that HGPO achieves superior performance on ALFWorld and WebShop benchmarks using Qwen2.5 models, consistently outperforming existing methods while maintaining identical GPU memory usage and LLM rollouts with minimal additional time cost.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[15] Graph-Enhanced Policy Optimization in LLM Agent Training PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Revealing context inconsistency in stepwise group-based RL
The authors identify and empirically demonstrate a fundamental problem called context inconsistency that occurs when steps within the same group have different historical contexts during stepwise advantage estimation. They show this issue causes severely biased advantage estimation and degrades policy optimization in long-horizon agentic tasks.
[34] Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning PDF
Hierarchy-of-Groups Policy Optimization (HGPO) algorithm
The authors introduce HGPO, a novel reinforcement learning algorithm that addresses context inconsistency through two key components: context-aware hierarchical grouping that assigns steps to multiple hierarchical groups based on historical context consistency, and adaptive weighting advantage estimation that aggregates group advantages with weights favoring more consistent contexts.
[43] Hybrid group relative policy optimization: A multi-sample approach to enhancing policy optimization PDF
[44] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF
[45] Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization PDF
State-of-the-art empirical performance on agentic benchmarks
The authors demonstrate that HGPO achieves superior performance on ALFWorld and WebShop benchmarks using Qwen2.5 models, consistently outperforming existing methods while maintaining identical GPU memory usage and LLM rollouts with minimal additional time cost.