J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLM-as-a-JudgeReasoningReinforcement Learning

The progress of AI is bottlenecked by the quality of evaluation, making powerful LLM-as-a-Judge models a core solution. The efficacy of these judges depends on their chain-of-thought reasoning, creating a critical need for methods that can effectively optimize this reasoning process. In this work, we introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions. Our core contribution lies in converting all judgment tasks for nonverifiable and verifiable prompts into a unified format with verifiable rewards, enabling direct optimization of evaluation quality while mitigating positional bias. We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance across multiple benchmarks. In particular, J1-Qwen-32B, our multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a much larger 671B DeepSeek-R1 on some benchmarks, while only training on synthetic data. Through comprehensive ablations of pairwise, pointwise, and multitask J1 variants, we demonstrate the effectiveness of our approach across seed prompts, reward strategies, and training recipes. Qualitative analysis reveals that J1 develops systematic evaluation strategies, including dynamic criteria generation, reference answer creation, iterative self-correction of initial assessments, and feedback generation for low-quality responses.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces J1, a reinforcement learning framework for training LLM judges to generate reasoning traces before making evaluation decisions. It resides in the Reasoning-Enhanced Judge Training leaf, which contains four papers total—a relatively sparse cluster within the broader taxonomy. This positioning suggests the work addresses a focused research direction: explicitly teaching judges to think through chain-of-thought reasoning rather than directly optimizing judgment outputs. The sibling papers in this leaf (Think J, Plan Reason Evaluation, and High Entropy Tokens) similarly explore reasoning scaffolds for evaluation, indicating a small but active community investigating how to incentivize deliberative judgment processes.

The taxonomy reveals that Reasoning-Enhanced Judge Training sits within the larger LLM Judge Training and Evaluation branch, which also includes Direct Judgment Optimization (preference learning without reasoning scaffolds) and Meta-Evaluation frameworks. Neighboring branches such as General Reasoning Enhancement via RL contain foundational methods like DeepSeek R1 and Learning Reason Search, which improve reasoning across diverse tasks but lack the evaluation-specific focus. The scope notes clarify that J1's emphasis on converting judgment tasks into verifiable formats distinguishes it from general-purpose reasoning methods, while its RL-driven approach separates it from direct preference optimization techniques that bypass explicit reasoning traces.

Among the three contributions analyzed, the literature search examined 26 candidates total. The core J1 framework examined 10 candidates with zero refutable overlaps; the unified verifiable training format also examined 10 candidates with no clear refutations; the position-consistent pointwise training method examined 6 candidates, again with no refutations. These statistics suggest that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found that directly anticipates the specific combination of RL-driven reasoning incentives, unified verifiable rewards, and positional bias mitigation that J1 proposes. However, the small candidate pool means the analysis captures nearby work rather than exhaustive coverage.

Based on the limited search scope of 26 candidates, the work appears to occupy a distinct position within the sparse Reasoning-Enhanced Judge Training cluster. The absence of refutable overlaps across all three contributions suggests novelty relative to the examined literature, though the small candidate pool and focused taxonomy leaf indicate this assessment reflects local rather than field-wide context. The taxonomy structure shows that while general RL reasoning methods are well-populated, the specific intersection of judge training and reasoning scaffolds remains less explored.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Training LLM judges to reason before making evaluation decisions using reinforcement learning. The field structure reflects a broad ecosystem organized around several complementary themes. At the top level, one branch focuses explicitly on LLM Judge Training and Evaluation, encompassing methods that teach models to produce explicit reasoning traces before rendering judgments. A second major branch, General Reasoning Enhancement via RL, addresses foundational techniques for improving step-by-step inference across diverse problem types—works like DeepSeek R1[3] and Learning Reason Search[1] exemplify efforts to scale reasoning through search and policy optimization. Domain-Specific RL Reasoning Applications and Cross-Domain and Multi-Task RL Reasoning branches capture specialized adaptations (e.g., mathematics, vision-language tasks, software engineering), while Auxiliary Components and Reward Modeling groups studies on verifiable rewards and meta-rewarding strategies. Additional branches cover efficiency optimizations, interactive and multi-agent settings, alternative training paradigms, and surveys that synthesize emerging trends. Within this landscape, a particularly active line of work explores how to incentivize models to generate intermediate reasoning steps that improve final evaluation quality. Incentivizing Thinking Judge[0] sits squarely in the Reasoning-Enhanced Judge Training cluster, emphasizing RL-driven incentives for judges to think before deciding. This approach contrasts with nearby efforts such as Think J[36], which also targets judge reasoning but may differ in reward formulation or training dynamics, and Plan Reason Evaluation[6], which integrates planning modules into the evaluation pipeline. Meanwhile, works like High Entropy Tokens[5] and Test Time Scaling[40] investigate complementary angles—token-level uncertainty and inference-time compute trade-offs—that inform how reasoning traces are generated and validated. The central tension across these studies revolves around balancing the cost of extended reasoning against gains in judgment accuracy, and determining which RL signals best align model introspection with human-like evaluative rigor.

Claimed Contributions

J1 reinforcement learning framework for training thinking LLM judges

10 retrieved papers

The authors propose J1, a framework that uses reinforcement learning to train language models to perform chain-of-thought reasoning when evaluating responses. This enables judges to develop systematic evaluation strategies including criteria generation, reference answer creation, and iterative self-correction.

10 retrieved papers

Unified verifiable training format for judgment tasks

10 retrieved papers

The authors develop a unified training recipe that transforms both verifiable tasks (like math problems) and non-verifiable subjective prompts into a format that can be optimized with reinforcement learning from verifiable rewards, enabling training of a single generalist judge across diverse domains using only synthetic data.

10 retrieved papers

Method for training position-consistent pointwise judges using pairwise supervision

6 retrieved papers

The authors introduce a technique to train pointwise evaluation models that are inherently consistent across different positions by leveraging only pairwise supervision data. This approach addresses positional bias while enabling the development of multitask models capable of both pointwise and pairwise evaluations.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Learning to plan & reason for evaluation with thinking-llm-as-a-judge PDF

Saha, Swarnadeep, Li Xian, Swarnadeep Saha, Ghazvininejad, Marjan, Xian Li, Weston, Jason, Marjan Ghazvininejad, Wang, Tianlu, Jason Weston, Tianlu Wang (2025)

[36] Think-j: Learning to think for generative llm-as-a-judge PDF

Huang Hui, He, Yancheng, Hui Huang, Zhou Hongli, Yancheng He, Zhang Rui, Hongli Zhou, Liu Wei, Rui Zhang, Wang Wei-xun, Wei Liu, SU WenBo, Weixun Wang, Zheng Bo, Wenbo Su, Liu Jiaheng, Bo Zheng, Jiaheng Liu (2025)

[40] J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge PDF

Chan, Chi-Min, Xu, Chunpu, Chi-Min Chan, Ji, Jiaming, Chunpu Xu, Ye Zhen, Jiaming Ji, Wen Pengcheng, Zhen Ye, Jiang Chunyang, Pengcheng Wen, Yang YaoDong, Chunyang Jiang, Xue Wei, Yaodong Yang, Wei Xue, Guo Yike, Sirui Han, Yike Guo (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

J1 reinforcement learning framework for training thinking LLM judges

[6] Learning to plan & reason for evaluation with thinking-llm-as-a-judge PDF

Cannot Refute

[18] Reasoning language models: A blueprint PDF

Cannot Refute

[61] Reflexion: Language agents with verbal reinforcement learning PDF

Cannot Refute

[62] Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model PDF

Cannot Refute

[63] Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic PDF

Cannot Refute

[64] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

Cannot Refute

[65] Chain of preference optimization: Improving chain-of-thought reasoning in llms PDF

Cannot Refute

[66] Improve Vision Language Model Chain-of-thought Reasoning PDF

Cannot Refute

[67] ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback PDF

Cannot Refute

[68] Rewarding progress: Scaling automated process verifiers for llm reasoning PDF

Cannot Refute

Contribution

Unified verifiable training format for judgment tasks

[51] HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data PDF

Cannot Refute

[52] Compute as teacher: Turning inference compute into reference-free supervision PDF

Cannot Refute

[53] BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models PDF

Cannot Refute

[54] Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers PDF

Cannot Refute

[55] Mitigating Class Imbalance in Fact-Checking Datasets Through LLM-Based Synthetic Data Generation PDF

Cannot Refute

[56] Evaluation Metrics and Methods for Generative Models in the Wireless PHY Layer PDF

Cannot Refute

[57] Is checkworthiness generalizable? Evaluating task and domain generalization of datasets for claim detection PDF

Cannot Refute

[58] DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks PDF

Cannot Refute

[59] Towards explainable fact checking PDF

Cannot Refute

[60] Development of online system checkable for Japanese writing tasks PDF

Cannot Refute

Contribution

Method for training position-consistent pointwise judges using pairwise supervision

[69] The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators PDF

Cannot Refute

[70] RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation PDF

Cannot Refute

[71] Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm PDF

Cannot Refute

[72] Unbiased pairwise learning to rank in recommender systems PDF

Cannot Refute

[73] How Far Can SLMs Go WithoutThinking'in the LLM-as-a-Judge Paradigm? PDF

Cannot Refute

[74] CPFormer-Net: Correspondence Pruning Transformer With Structured Context Aggregation PDF

Cannot Refute

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Learning to plan & reason for evaluation with thinking-llm-as-a-judge PDF

[36] Think-j: Learning to think for generative llm-as-a-judge PDF

[40] J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge PDF

Contribution Analysis

J1 reinforcement learning framework for training thinking LLM judges

[6] Learning to plan & reason for evaluation with thinking-llm-as-a-judge PDF

[18] Reasoning language models: A blueprint PDF

[61] Reflexion: Language agents with verbal reinforcement learning PDF

[62] Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model PDF

[63] Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic PDF

[64] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

[65] Chain of preference optimization: Improving chain-of-thought reasoning in llms PDF

[66] Improve Vision Language Model Chain-of-thought Reasoning PDF

[67] ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback PDF

[68] Rewarding progress: Scaling automated process verifiers for llm reasoning PDF

Unified verifiable training format for judgment tasks

[51] HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data PDF

[52] Compute as teacher: Turning inference compute into reference-free supervision PDF

[53] BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models PDF

[54] Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers PDF

[55] Mitigating Class Imbalance in Fact-Checking Datasets Through LLM-Based Synthetic Data Generation PDF

[56] Evaluation Metrics and Methods for Generative Models in the Wireless PHY Layer PDF

[57] Is checkworthiness generalizable? Evaluating task and domain generalization of datasets for claim detection PDF

[58] DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks PDF

[59] Towards explainable fact checking PDF

[60] Development of online system checkable for Japanese writing tasks PDF

Method for training position-consistent pointwise judges using pairwise supervision

[69] The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators PDF

[70] RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation PDF

[71] Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm PDF

[72] Unbiased pairwise learning to rank in recommender systems PDF

[73] How Far Can SLMs Go WithoutThinking'in the LLM-as-a-Judge Paradigm? PDF

[74] CPFormer-Net: Correspondence Pruning Transformer With Structured Context Aggregation PDF

Table of Contents