J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
LLM-as-a-JudgeReasoningReinforcement Learning
Abstract:

The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for nonverifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces J1, a reinforcement learning framework for training LLM judges to generate reasoning traces before making evaluation decisions. It resides in the Reasoning-Enhanced Judge Training leaf, which contains four papers total—a relatively sparse cluster within the broader taxonomy. This positioning suggests the work addresses a focused research direction: explicitly teaching judges to think through chain-of-thought reasoning rather than directly optimizing judgment outputs. The sibling papers in this leaf (Think J, Plan Reason Evaluation, and High Entropy Tokens) similarly explore reasoning scaffolds for evaluation, indicating a small but active community investigating how to incentivize deliberative judgment processes.

The taxonomy reveals that Reasoning-Enhanced Judge Training sits within the larger LLM Judge Training and Evaluation branch, which also includes Direct Judgment Optimization (preference learning without reasoning scaffolds) and Meta-Evaluation frameworks. Neighboring branches such as General Reasoning Enhancement via RL contain foundational methods like DeepSeek R1 and Learning Reason Search, which improve reasoning across diverse tasks but lack the evaluation-specific focus. The scope notes clarify that J1's emphasis on converting judgment tasks into verifiable formats distinguishes it from general-purpose reasoning methods, while its RL-driven approach separates it from direct preference optimization techniques that bypass explicit reasoning traces.

Among the three contributions analyzed, the literature search examined 26 candidates total. The core J1 framework examined 10 candidates with zero refutable overlaps; the unified verifiable training format also examined 10 candidates with no clear refutations; the position-consistent pointwise training method examined 6 candidates, again with no refutations. These statistics suggest that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found that directly anticipates the specific combination of RL-driven reasoning incentives, unified verifiable rewards, and positional bias mitigation that J1 proposes. However, the small candidate pool means the analysis captures nearby work rather than exhaustive coverage.

Based on the limited search scope of 26 candidates, the work appears to occupy a distinct position within the sparse Reasoning-Enhanced Judge Training cluster. The absence of refutable overlaps across all three contributions suggests novelty relative to the examined literature, though the small candidate pool and focused taxonomy leaf indicate this assessment reflects local rather than field-wide context. The taxonomy structure shows that while general RL reasoning methods are well-populated, the specific intersection of judge training and reasoning scaffolds remains less explored.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Training LLM judges to reason before making evaluation decisions using reinforcement learning. The field structure reflects a broad ecosystem organized around several complementary themes. At the top level, one branch focuses explicitly on LLM Judge Training and Evaluation, encompassing methods that teach models to produce explicit reasoning traces before rendering judgments. A second major branch, General Reasoning Enhancement via RL, addresses foundational techniques for improving step-by-step inference across diverse problem types—works like DeepSeek R1[3] and Learning Reason Search[1] exemplify efforts to scale reasoning through search and policy optimization. Domain-Specific RL Reasoning Applications and Cross-Domain and Multi-Task RL Reasoning branches capture specialized adaptations (e.g., mathematics, vision-language tasks, software engineering), while Auxiliary Components and Reward Modeling groups studies on verifiable rewards and meta-rewarding strategies. Additional branches cover efficiency optimizations, interactive and multi-agent settings, alternative training paradigms, and surveys that synthesize emerging trends. Within this landscape, a particularly active line of work explores how to incentivize models to generate intermediate reasoning steps that improve final evaluation quality. Incentivizing Thinking Judge[0] sits squarely in the Reasoning-Enhanced Judge Training cluster, emphasizing RL-driven incentives for judges to think before deciding. This approach contrasts with nearby efforts such as Think J[36], which also targets judge reasoning but may differ in reward formulation or training dynamics, and Plan Reason Evaluation[6], which integrates planning modules into the evaluation pipeline. Meanwhile, works like High Entropy Tokens[5] and Test Time Scaling[40] investigate complementary angles—token-level uncertainty and inference-time compute trade-offs—that inform how reasoning traces are generated and validated. The central tension across these studies revolves around balancing the cost of extended reasoning against gains in judgment accuracy, and determining which RL signals best align model introspection with human-like evaluative rigor.

Claimed Contributions

J1 reinforcement learning framework for training thinking LLM judges

The authors propose J1, a framework that uses reinforcement learning to train language models to perform chain-of-thought reasoning when evaluating responses. This enables judges to develop systematic evaluation strategies including criteria generation, reference answer creation, and iterative self-correction.

10 retrieved papers
Unified verifiable training format for judgment tasks

The authors develop a unified training recipe that transforms both verifiable tasks (like math problems) and non-verifiable subjective prompts into a format that can be optimized with reinforcement learning from verifiable rewards, enabling training of a single generalist judge across diverse domains using only synthetic data.

10 retrieved papers
Method for training position-consistent pointwise judges using pairwise supervision

The authors introduce a technique to train pointwise evaluation models that are inherently consistent across different positions by leveraging only pairwise supervision data. This approach addresses positional bias while enabling the development of multitask models capable of both pointwise and pairwise evaluations.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

J1 reinforcement learning framework for training thinking LLM judges

The authors propose J1, a framework that uses reinforcement learning to train language models to perform chain-of-thought reasoning when evaluating responses. This enables judges to develop systematic evaluation strategies including criteria generation, reference answer creation, and iterative self-correction.

Contribution

Unified verifiable training format for judgment tasks

The authors develop a unified training recipe that transforms both verifiable tasks (like math problems) and non-verifiable subjective prompts into a format that can be optimized with reinforcement learning from verifiable rewards, enabling training of a single generalist judge across diverse domains using only synthetic data.

Contribution

Method for training position-consistent pointwise judges using pairwise supervision

The authors introduce a technique to train pointwise evaluation models that are inherently consistent across different positions by leveraging only pairwise supervision data. This approach addresses positional bias while enabling the development of multitask models capable of both pointwise and pairwise evaluations.