Online Rubrics Elicitation from Pairwise Comparisons

ICLR 2026 Conference SubmissionAnonymous Authors
rubricschecklistspost-trainingreward hackingreinforcement learning
Abstract:

Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OnlineRubrics, a method for dynamically eliciting evaluation criteria during reinforcement learning by analyzing pairwise comparisons between current and reference policy outputs. It resides in the 'Online Rubric and Criteria Elicitation' leaf of the taxonomy, which contains only two papers total. This places the work in a relatively sparse research direction within the broader field of dynamic and adaptive reward modeling, suggesting that online rubric generation during training remains an underexplored area compared to static reward modeling approaches.

The taxonomy reveals that the paper sits within 'Dynamic and Adaptive Reward Modeling,' which contrasts with neighboring branches like 'Pairwise Preference-Based Reward Modeling' (static Bradley-Terry models) and 'Generative and Rule-Based Reward Modeling' (fixed rule extraction). The sibling paper in the same leaf, OpenRubrics, also targets rubric generation but differs in temporal dynamics emphasis. Adjacent leaves include 'Adaptive Preference Learning and Model Refinement' (iterative reward updates) and 'Static Rubric and Rule-Based Reward Generation' (fixed criteria extraction), highlighting the paper's focus on continuous, online elicitation rather than one-shot or static approaches.

Among 23 candidates examined across three contributions, none were found to clearly refute the proposed work. The core OnlineRubrics method examined 10 candidates with zero refutable overlaps, the dataset contribution examined 3 candidates with no refutations, and the formal gradient variance motivation examined 10 candidates with no refutations. This limited search scope suggests that within the top semantic matches and citation expansions analyzed, no prior work directly anticipates the combination of online rubric elicitation with pairwise comparison-driven criteria discovery during RL training.

Based on the limited literature search of 23 candidates, the work appears to occupy a novel position at the intersection of dynamic reward modeling and interpretable criteria generation. The sparse taxonomy leaf and absence of refutable prior work within the examined scope indicate potential originality, though a broader search beyond top-K semantic matches might reveal additional related efforts in adaptive evaluation or online preference learning.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: dynamic rubric elicitation from pairwise comparisons during reinforcement learning. The field has organized itself around several complementary directions. Pairwise Preference-Based Reward Modeling focuses on extracting reward signals directly from human or automated comparisons, often using Bradley-Terry models or ranking-based objectives (e.g., Helpful Harmless Assistant[2], AlpacaFarm[26]). Dynamic and Adaptive Reward Modeling emphasizes methods that update or refine reward structures online, including approaches that elicit criteria or rubrics interactively (e.g., OpenRubrics[11], Valid Feedback RL[3]). Generative and Rule-Based Reward Modeling explores using language models to produce interpretable reward functions or rules (e.g., Generative Reward Modeling[4], AutoRule[9]). Feedback-Driven Optimization and Interaction investigates iterative loops where agent behavior and human feedback co-evolve (e.g., Feedback Descent[14], LLaMA-Berry[6]). Vision-Language and Multimodal Reinforcement Learning extends preference learning to settings with visual or cross-modal inputs (e.g., RL-VLM-F[10]). Finally, Specialized Domain Applications of Preference-Based RL applies these techniques to robotics, aircraft handling, and other domains (e.g., Aircraft Handling RLHF[5]). A particularly active line of work centers on making reward modeling more transparent and adaptive. Generative approaches like Generative Reward Modeling[4] and AutoRule[9] aim to produce human-readable criteria, while online elicitation methods such as OpenRubrics[11] and Valid Feedback RL[3] refine rubrics during training. Online Rubrics Elicitation[0] sits squarely within this dynamic and adaptive branch, emphasizing the interactive discovery of evaluation criteria from pairwise comparisons. Compared to OpenRubrics[11], which also targets rubric generation, Online Rubrics Elicitation[0] focuses more explicitly on the temporal dynamics of elicitation during the learning loop. Meanwhile, Valid Feedback RL[3] shares the goal of incorporating evolving human input but does not necessarily structure feedback as explicit rubrics. These contrasts highlight an open question: how to balance interpretability, adaptability, and sample efficiency when criteria themselves must be learned alongside policies.

Claimed Contributions

Online Rubrics Elicitation (OnlineRubrics) method

A framework that dynamically elicits new evaluation criteria during reinforcement learning by comparing responses from the current policy and a control policy. This enables continuous identification and mitigation of errors as training proceeds, addressing limitations of static rubrics that are vulnerable to reward-hacking and fail to capture emergent desiderata.

10 retrieved papers
Two curated datasets for expert and generalist domains

Two rubric-based datasets (Generalist Rubrics and Expert Rubrics) containing prompts with human-authored, prompt-specific rubrics composed of weighted, binary-checkable criteria. These datasets enable training and evaluation of rubric-based reinforcement learning methods across different domains.

3 retrieved papers
Formal motivation showing gradient variance reduction

A theoretical result (Proposition 1) demonstrating that augmenting rubrics to better approximate the true criterion set reduces the variance term in policy gradient updates, leading to improved stability and sample efficiency during training by tightening the upper bound on unmodeled criteria mass.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Online Rubrics Elicitation (OnlineRubrics) method

A framework that dynamically elicits new evaluation criteria during reinforcement learning by comparing responses from the current policy and a control policy. This enables continuous identification and mitigation of errors as training proceeds, addressing limitations of static rubrics that are vulnerable to reward-hacking and fail to capture emergent desiderata.

Contribution

Two curated datasets for expert and generalist domains

Two rubric-based datasets (Generalist Rubrics and Expert Rubrics) containing prompts with human-authored, prompt-specific rubrics composed of weighted, binary-checkable criteria. These datasets enable training and evaluation of rubric-based reinforcement learning methods across different domains.

Contribution

Formal motivation showing gradient variance reduction

A theoretical result (Proposition 1) demonstrating that augmenting rubrics to better approximate the true criterion set reduces the variance term in policy gradient updates, leading to improved stability and sample efficiency during training by tightening the upper bound on unmodeled criteria mass.