Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

ICLR 2026 Conference SubmissionAnonymous Authors
hierarchical reinforcement learningpreference based learning
Abstract:

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on higher-level learning. To address infeasible subgoals, DIPPER incorporates lower-level value function regularization that encourages the higher-level policy to propose achievable subgoals. We introduce two novel metrics to quantitatively verify that DIPPER mitigates non-stationarity and infeasible subgoal generation issues in HRL. Empirical evaluation on challenging robotic navigation and manipulation benchmarks shows that DIPPER achieves upto 40% improvements over state-of-the-art baselines on challenging sparse-reward scenarios, highlighting the potential of preference-based learning for addressing longstanding HRL limitations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DIPPER, a hierarchical RL framework that applies direct preference optimization to address non-stationarity and infeasible subgoal generation in goal-conditioned HRL. It resides in the 'Direct Preference Optimization for Hierarchical RL' leaf, which contains four papers total. This is a relatively sparse research direction within the broader taxonomy of 43 papers across the field, suggesting that applying DPO specifically to hierarchical RL remains an emerging area with limited prior exploration.

The taxonomy reveals that DIPPER sits within the 'Preference-Based Reward Learning and Modeling' branch, which also includes sibling leaves on reward model training, multi-level feedback integration, and active preference elicitation. Neighboring branches address hierarchical policy decomposition (subgoal generation, primitives, multi-objective optimization) and application domains (LLMs, robotics, autonomous systems). The scope notes clarify that this leaf focuses on bi-level optimization and preference comparisons, distinguishing it from methods that use pre-defined rewards or non-hierarchical DPO approaches.

Among 30 candidates examined, none clearly refute the three core contributions: the bi-level optimization framework (10 candidates, 0 refutable), the DIPPER framework using DPO (10 candidates, 0 refutable), and the novel metrics for non-stationarity and infeasible subgoals (10 candidates, 0 refutable). This suggests that within the limited search scope, the specific combination of DPO for hierarchical RL with value function regularization and quantitative metrics appears relatively unexplored. However, the search scale is modest, and the analysis does not claim exhaustive coverage of all relevant prior work.

Based on the limited literature search of 30 top-K semantic matches, the work appears to occupy a distinct position within a sparse research direction. The taxonomy context indicates that while hierarchical RL and preference-based methods are both active areas, their intersection—especially using DPO—remains less crowded. The analysis cannot rule out relevant work outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
43
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Hierarchical reinforcement learning with preference-based optimization. This field sits at the intersection of two major research directions: learning hierarchical policies that decompose complex tasks into manageable subtasks, and aligning agent behavior with human preferences rather than hand-crafted reward functions. The taxonomy reflects this dual focus through four main branches. Preference-Based Reward Learning and Modeling encompasses methods that learn reward signals from human feedback, including direct preference optimization techniques like DPO Hierarchical RL[0] and approaches that model preferences at multiple levels of abstraction such as Hierarchical Preference Learning[3]. Hierarchical Policy Learning and Decomposition addresses the structural challenge of breaking down tasks into goal-conditioned subpolicies and temporal abstractions. Application Domains and Task-Specific Methods captures diverse instantiations ranging from robotics and navigation to language model fine-tuning, while Theoretical Foundations and Survey Methods provides the conceptual underpinnings and broader perspectives on the field. Recent work has explored how preference feedback can be integrated at different levels of a hierarchical policy, creating a rich design space with distinct trade-offs. Some approaches apply preference optimization directly to primitive-level actions, as in DPO Primitive Hierarchical[19], while others like DIPPER[35] and DIPPER Hierarchical[43] investigate how preferences can guide both high-level goal selection and low-level execution. DPO Hierarchical RL[0] falls within this latter category, emphasizing direct preference optimization across hierarchical structures. This contrasts with earlier methods such as Hierarchical Preference Learning[3], which focused on modeling preferences at multiple abstraction levels but used different optimization frameworks. A central open question is whether preference signals are most effective when applied uniformly across all hierarchy levels or when tailored to specific layers, and how such choices affect sample efficiency and alignment quality in complex domains.

Claimed Contributions

Bi-level optimization framework for goal-conditioned HRL

The authors formulate hierarchical reinforcement learning as a bi-level optimization problem where the higher-level policy optimization constitutes the upper-level problem and the lower-level policy optimization forms the lower-level problem. This unified framework enables joint optimization of both policies while explicitly modeling their inter-dependencies.

10 retrieved papers
DIPPER framework using DPO to mitigate non-stationarity and infeasible subgoals

DIPPER trains the higher-level policy using direct preference optimization on stationary preference datasets, decoupling higher-level learning from the non-stationary lower-level reward signal. It incorporates lower-level value function regularization to ensure the higher-level policy generates only feasible subgoals.

10 retrieved papers
Novel metrics for quantifying non-stationarity and infeasible subgoal generation

The authors introduce two metrics: the subgoal distance metric (measuring average distance between predicted and achieved subgoals) and the lower Q-function metric (measuring lower-level Q-values for predicted subgoals) to quantitatively assess non-stationarity and feasibility of subgoals in hierarchical RL.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Bi-level optimization framework for goal-conditioned HRL

The authors formulate hierarchical reinforcement learning as a bi-level optimization problem where the higher-level policy optimization constitutes the upper-level problem and the lower-level policy optimization forms the lower-level problem. This unified framework enables joint optimization of both policies while explicitly modeling their inter-dependencies.

Contribution

DIPPER framework using DPO to mitigate non-stationarity and infeasible subgoals

DIPPER trains the higher-level policy using direct preference optimization on stationary preference datasets, decoupling higher-level learning from the non-stationary lower-level reward signal. It incorporates lower-level value function regularization to ensure the higher-level policy generates only feasible subgoals.

Contribution

Novel metrics for quantifying non-stationarity and infeasible subgoal generation

The authors introduce two metrics: the subgoal distance metric (measuring average distance between predicted and achieved subgoals) and the lower Q-function metric (measuring lower-level Q-values for predicted subgoals) to quantitatively assess non-stationarity and feasibility of subgoals in hierarchical RL.