Online Rubrics Elicitation from Pairwise Comparisons

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

rubricschecklistspost-trainingreward hackingreinforcement learning

Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OnlineRubrics, a method for dynamically eliciting evaluation criteria during reinforcement learning by analyzing pairwise comparisons between current and reference policy outputs. It resides in the 'Online Rubric and Criteria Elicitation' leaf of the taxonomy, which contains only two papers total. This places the work in a relatively sparse research direction within the broader field of dynamic and adaptive reward modeling, suggesting that online rubric generation during training remains an underexplored area compared to static reward modeling approaches.

The taxonomy reveals that the paper sits within 'Dynamic and Adaptive Reward Modeling,' which contrasts with neighboring branches like 'Pairwise Preference-Based Reward Modeling' (static Bradley-Terry models) and 'Generative and Rule-Based Reward Modeling' (fixed rule extraction). The sibling paper in the same leaf, OpenRubrics, also targets rubric generation but differs in temporal dynamics emphasis. Adjacent leaves include 'Adaptive Preference Learning and Model Refinement' (iterative reward updates) and 'Static Rubric and Rule-Based Reward Generation' (fixed criteria extraction), highlighting the paper's focus on continuous, online elicitation rather than one-shot or static approaches.

Among 23 candidates examined across three contributions, none were found to clearly refute the proposed work. The core OnlineRubrics method examined 10 candidates with zero refutable overlaps, the dataset contribution examined 3 candidates with no refutations, and the formal gradient variance motivation examined 10 candidates with no refutations. This limited search scope suggests that within the top semantic matches and citation expansions analyzed, no prior work directly anticipates the combination of online rubric elicitation with pairwise comparison-driven criteria discovery during RL training.

Based on the limited literature search of 23 candidates, the work appears to occupy a novel position at the intersection of dynamic reward modeling and interpretable criteria generation. The sparse taxonomy leaf and absence of refutable prior work within the examined scope indicate potential originality, though a broader search beyond top-K semantic matches might reveal additional related efforts in adaptive evaluation or online preference learning.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: dynamic rubric elicitation from pairwise comparisons during reinforcement learning. The field has organized itself around several complementary directions. Pairwise Preference-Based Reward Modeling focuses on extracting reward signals directly from human or automated comparisons, often using Bradley-Terry models or ranking-based objectives (e.g., Helpful Harmless Assistant[2], AlpacaFarm[26]). Dynamic and Adaptive Reward Modeling emphasizes methods that update or refine reward structures online, including approaches that elicit criteria or rubrics interactively (e.g., OpenRubrics[11], Valid Feedback RL[3]). Generative and Rule-Based Reward Modeling explores using language models to produce interpretable reward functions or rules (e.g., Generative Reward Modeling[4], AutoRule[9]). Feedback-Driven Optimization and Interaction investigates iterative loops where agent behavior and human feedback co-evolve (e.g., Feedback Descent[14], LLaMA-Berry[6]). Vision-Language and Multimodal Reinforcement Learning extends preference learning to settings with visual or cross-modal inputs (e.g., RL-VLM-F[10]). Finally, Specialized Domain Applications of Preference-Based RL applies these techniques to robotics, aircraft handling, and other domains (e.g., Aircraft Handling RLHF[5]). A particularly active line of work centers on making reward modeling more transparent and adaptive. Generative approaches like Generative Reward Modeling[4] and AutoRule[9] aim to produce human-readable criteria, while online elicitation methods such as OpenRubrics[11] and Valid Feedback RL[3] refine rubrics during training. Online Rubrics Elicitation[0] sits squarely within this dynamic and adaptive branch, emphasizing the interactive discovery of evaluation criteria from pairwise comparisons. Compared to OpenRubrics[11], which also targets rubric generation, Online Rubrics Elicitation[0] focuses more explicitly on the temporal dynamics of elicitation during the learning loop. Meanwhile, Valid Feedback RL[3] shares the goal of incorporating evolving human input but does not necessarily structure feedback as explicit rubrics. These contrasts highlight an open question: how to balance interpretability, adaptability, and sample efficiency when criteria themselves must be learned alongside policies.

Claimed Contributions

Online Rubrics Elicitation (OnlineRubrics) method

10 retrieved papers

A framework that dynamically elicits new evaluation criteria during reinforcement learning by comparing responses from the current policy and a control policy. This enables continuous identification and mitigation of errors as training proceeds, addressing limitations of static rubrics that are vulnerable to reward-hacking and fail to capture emergent desiderata.

10 retrieved papers

Two curated datasets for expert and generalist domains

3 retrieved papers

Two rubric-based datasets (Generalist Rubrics and Expert Rubrics) containing prompts with human-authored, prompt-specific rubrics composed of weighted, binary-checkable criteria. These datasets enable training and evaluation of rubric-based reinforcement learning methods across different domains.

3 retrieved papers

Formal motivation showing gradient variance reduction

10 retrieved papers

A theoretical result (Proposition 1) demonstrating that augmenting rubrics to better approximate the true criterion set reduces the variance term in policy gradient updates, leading to improved stability and sample efficiency during training by tightening the upper bound on unmodeled criteria mass.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] Dynamic Evaluation of Reward Models via Pairwise Maximum Discrepancy Competition PDF

S Luo, P Cao, Z Zhu, K Feng, Z Wang, K Ding (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Online Rubrics Elicitation (OnlineRubrics) method

[31] Principled reinforcement learning with human feedback from pairwise or k-wise comparisons PDF

Cannot Refute

[32] A survey of reinforcement learning from human feedback PDF

Cannot Refute

[33] Relatively rational: Learning utilities and rationalities jointly from pairwise preferences PDF

Cannot Refute

[34] A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning PDF

Cannot Refute

[35] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning PDF

Cannot Refute

[36] Reward learning from multiple feedback types PDF

Cannot Refute

[37] Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning PDF

Cannot Refute

[38] Improving reinforcement learning from human feedback using contrastive rewards PDF

Cannot Refute

[39] Carrot and stick: Eliciting comparison data and beyond PDF

Cannot Refute

[40] T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation PDF

Cannot Refute

Contribution

Two curated datasets for expert and generalist domains

[41] A Scalable Framework for Evaluating Health Language Models PDF

Cannot Refute

[42] A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains PDF

Cannot Refute

[43] Rubric-Guided Lightweight Large Language Model Annotation of Patient Medication Reviews: Ordinal Agreement, Uncertainty, and Downstream Learnability PDF

Cannot Refute

Contribution

Formal motivation showing gradient variance reduction

[44] AlgaeDICE: Policy Gradient from Arbitrary Experience PDF

Cannot Refute

[45] Variance aware reward smoothing for deep reinforcement learning PDF

Cannot Refute

[46] Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods PDF

Cannot Refute

[47] Learning to Balance Lead Bias in News Summarization PDF

Cannot Refute

[48] KIPPO: Koopman-Inspired Proximal Policy Optimization PDF

Cannot Refute

[49] PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization PDF

Cannot Refute

[50] Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization PDF

Cannot Refute

[51] Scalable Robot Learning PDF

Cannot Refute

[52] ISTANBUL TECHNICAL UNIVERSITY å» GRADUATE SCHOOL PDF

Cannot Refute

[53] BiVWAC: Improving deep reinforcement learning algorithms using Bias-Variance Weighted Actor-Critic PDF

Cannot Refute

Online Rubrics Elicitation from Pairwise Comparisons

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] Dynamic Evaluation of Reward Models via Pairwise Maximum Discrepancy Competition PDF

Contribution Analysis

Online Rubrics Elicitation (OnlineRubrics) method

[31] Principled reinforcement learning with human feedback from pairwise or k-wise comparisons PDF

[32] A survey of reinforcement learning from human feedback PDF

[33] Relatively rational: Learning utilities and rationalities jointly from pairwise preferences PDF

[34] A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning PDF

[35] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning PDF

[36] Reward learning from multiple feedback types PDF

[37] Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning PDF

[38] Improving reinforcement learning from human feedback using contrastive rewards PDF

[39] Carrot and stick: Eliciting comparison data and beyond PDF

[40] T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation PDF

Two curated datasets for expert and generalist domains

[41] A Scalable Framework for Evaluating Health Language Models PDF

[42] A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains PDF

[43] Rubric-Guided Lightweight Large Language Model Annotation of Patient Medication Reviews: Ordinal Agreement, Uncertainty, and Downstream Learnability PDF

Formal motivation showing gradient variance reduction

[44] AlgaeDICE: Policy Gradient from Arbitrary Experience PDF

[45] Variance aware reward smoothing for deep reinforcement learning PDF

[46] Reimagining Exploration: Theoretical Insights and Practical Advancements in Policy Gradient Methods PDF

[47] Learning to Balance Lead Bias in News Summarization PDF

[48] KIPPO: Koopman-Inspired Proximal Policy Optimization PDF

[49] PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization PDF

[50] Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization PDF

[51] Scalable Robot Learning PDF

[52] ISTANBUL TECHNICAL UNIVERSITY å» GRADUATE SCHOOL PDF

[53] BiVWAC: Improving deep reinforcement learning algorithms using Bias-Variance Weighted Actor-Critic PDF

Table of Contents

[52] ISTANBUL TECHNICAL UNIVERSITY å» GRADUATE SCHOOL PDF