Policy Likelihood-based Query Sampling and Critic-Exploited Reset for Efficient Preference-based Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
preference-based reinforcement learningrobotic manipulationlocomotion
Abstract:

Preference-based reinforcement learning (PbRL) enables agent training without explicit reward design by leveraging human feedback. Although various query sampling strategies have been proposed to improve feedback efficiency, many fail to enhance performance because they select queries from outdated experiences with low likelihood under the current policy. Such queries may no longer represent the agent's evolving behavior patterns, reducing the informativeness of human feedback. To address this issue, we propose a policy likelihood-based query sampling and critic-exploited reset (PoLiCER). Our approach uses policy likelihood-based query sampling to ensure that queries remain aligned with the agent’s evolving behavior. However, relying solely on policy-aligned sampling can result in overly localized guidance, leading to overestimation bias, as the model tends to overfit to early feedback experiences. To mitigate this, PoLiCER incorporates a dynamic resetting mechanism that selectively resets the reward estimator and its associated Q-function based on critic outputs. Experimental evaluation across diverse locomotion and robotic manipulation tasks demonstrates that PoLiCER consistently outperforms existing PbRL methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes PoLiCER, combining policy likelihood-based query sampling with a critic-exploited reset mechanism to improve feedback efficiency in preference-based reinforcement learning. It resides in the 'Active Query Selection and Sampling' leaf, which contains four papers total including this one. This leaf sits within the broader 'Feedback Collection and Query Strategies' branch, indicating a moderately populated research direction focused on minimizing human annotation effort. The taxonomy reveals this is an active area with multiple competing approaches to query selection, suggesting the problem space is well-explored but not saturated.

The paper's immediate neighbors include uncertainty-driven methods (Reward Uncertainty Exploration) and exploration-focused strategies (Efficient Online Exploration), both addressing query efficiency through different principles. The sibling 'Label Smoothing and Experience Alignment' leaf contains three papers tackling related overfitting issues through regularization rather than dynamic resetting. The broader 'Reward Model Learning and Estimation' branch (nine papers across four leaves) addresses complementary challenges in handling noisy preferences and model robustness, while 'Policy Optimization and Training Algorithms' (nine papers) focuses on downstream policy updates. PoLiCER bridges query selection with policy training dynamics, positioning it at the intersection of these branches.

Among eleven candidates examined, the policy likelihood-based sampling contribution shows overlap with prior work: two papers appear to provide refutable evidence from nine candidates reviewed. The critic-exploited reset mechanism examined two candidates with no clear refutations, suggesting greater novelty in this component. The combined PoLiCER framework was not directly compared against candidates. The limited search scope (eleven papers, not exhaustive) means these statistics reflect top-K semantic matches rather than comprehensive field coverage. The reset mechanism addressing overestimation bias appears less explored in the examined literature than policy-aligned sampling strategies.

Based on the limited search of eleven semantically similar papers, the work appears to offer incremental contributions in query sampling with potentially stronger novelty in the dynamic reset mechanism. The taxonomy context shows this sits in a moderately active research area with established competing approaches. The analysis cannot definitively assess novelty beyond the examined candidates, and a broader literature review would be needed to confirm whether the critic-exploited reset represents a substantive departure from existing overfitting mitigation strategies in preference-based RL.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
11
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: efficient preference-based reinforcement learning with human feedback. The field has matured into several interconnected branches that address complementary challenges in learning from human preferences. Reward Model Learning and Estimation focuses on building accurate representations of human values from comparison data, often grappling with noise and distributional robustness (e.g., RIME Robust Preferences[4], Minimaximalist RLHF[3]). Policy Optimization and Training Algorithms develops scalable methods for aligning agent behavior with learned reward models, including parameter-efficient approaches (Parameter Efficient RLHF[10]) and online iterative schemes (Online Iterative RLHF[27]). Feedback Collection and Query Strategies tackles the sample efficiency bottleneck by intelligently selecting which queries to pose to human annotators, encompassing active learning and exploration-driven methods (Reward Uncertainty Exploration[41], Efficient Online Exploration[34]). Personalization and Multi-Objective Alignment extends the framework to handle diverse user preferences and safety constraints (Safe RLHF[1], Multi-objective Preference RL[24]), while Foundations and Survey Literature provides theoretical grounding (RLHF Survey[8], Preference-based RL Review[15]) and Domain-Specific Applications demonstrates practical deployment across robotics, language models, and autonomous systems. A particularly active research direction centers on reducing the number of human queries required for effective learning, balancing exploration with exploitation under uncertainty. Policy Likelihood Query Sampling[0] sits squarely within this active query selection cluster, proposing to prioritize queries based on policy likelihood to maximize information gain. This approach contrasts with uncertainty-driven methods like Reward Uncertainty Exploration[41], which explicitly model epistemic uncertainty in the reward function, and with exploration-focused strategies such as Efficient Online Exploration[34], which emphasize discovering informative state-action regions. While Sequential Preference Ranking[14] addresses query efficiency through structured ranking formats, Policy Likelihood Query Sampling[0] leverages the policy's own trajectory distribution to guide sampling. The central trade-off across these works involves computational overhead versus sample efficiency: more sophisticated query selection can dramatically reduce annotation burden but may introduce additional modeling complexity or assumptions about the underlying preference structure.

Claimed Contributions

Policy likelihood-based query sampling (PLS)

A query sampling strategy that selects trajectory pairs for human feedback based on their likelihood under the current policy rather than temporal recency. This ensures queries remain relevant to the agent's evolving behavior throughout training, addressing query-policy misalignment in preference-based reinforcement learning.

9 retrieved papers
Can Refute
Critic-exploited reset (CER)

A dynamic resetting mechanism that monitors critic outputs to detect reward overestimation and strategically resets both the reward estimator and Q-function when necessary. This approach mitigates primacy bias while maintaining computational efficiency through adaptive thresholding.

2 retrieved papers
PoLiCER framework combining PLS and CER

An integrated framework that combines policy likelihood-based query sampling with critic-exploited reset to address both query-policy misalignment and primacy bias in preference-based reinforcement learning. The two components work synergistically to improve feedback efficiency and learning stability.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Policy likelihood-based query sampling (PLS)

A query sampling strategy that selects trajectory pairs for human feedback based on their likelihood under the current policy rather than temporal recency. This ensures queries remain relevant to the agent's evolving behavior throughout training, addressing query-policy misalignment in preference-based reinforcement learning.

Contribution

Critic-exploited reset (CER)

A dynamic resetting mechanism that monitors critic outputs to detect reward overestimation and strategically resets both the reward estimator and Q-function when necessary. This approach mitigates primacy bias while maintaining computational efficiency through adaptive thresholding.

Contribution

PoLiCER framework combining PLS and CER

An integrated framework that combines policy likelihood-based query sampling with critic-exploited reset to address both query-policy misalignment and primacy bias in preference-based reinforcement learning. The two components work synergistically to improve feedback efficiency and learning stability.

Policy Likelihood-based Query Sampling and Critic-Exploited Reset for Efficient Preference-based Reinforcement Learning | Novelty Validation