wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion Language ModelsReinforcement LearningReasoning
Abstract:

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, \textit{wd1} outperforms diffusion-based GRPO (\textit{d1}) while requiring lower computational cost, achieving up to a +59%+59\% improvement in accuracy. Furthermore, we extend \textit{wd1} to denoising-stepwise weighted policy optimization (\algname++), achieving state-of-the-art math performance of 44.2%44.2\% on MATH500 and 84.5%84.5\% on GSM8K with only 20 RL training steps.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
11
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for diffusion-based large language models. The field has rapidly expanded into several major branches that reflect different emphases in adapting RL to discrete diffusion architectures. Policy Optimization Algorithms for Diffusion Language Models focuses on developing tractable gradient estimators and variance reduction techniques—such as likelihood approximation methods—that enable stable training of diffusion-based text generators under reward signals. Exploration and Inference Strategies examine how to guide the iterative denoising process at test time, while Multimodal and Unified Diffusion Architectures extend these ideas beyond text to vision and cross-modal settings. Meanwhile, Diffusion Models for Reinforcement Learning Tasks and Application-Specific Diffusion and RL Integration explore using diffusion as a planning or policy representation in traditional RL domains, and Reinforcement Learning for General Diffusion Model Alignment addresses broader safety and preference alignment questions. Surveys, Benchmarks, and Comparative Studies provide overarching perspectives, and Open-Source Diffusion Language Model Implementations offer practical tooling for the community. Within Policy Optimization, a dense cluster of works tackles the challenge of high-variance gradients inherent in discrete diffusion. LLaDA Variance Reduced[8] and Sandwiched Policy Gradient[16] exemplify efforts to bound or reduce variance through control variates and tighter likelihood bounds, while Boundary-Guided Policy[14] and d2 Improved Techniques[37] propose alternative parameterizations or training schedules. Weighted Policy Optimization[0] sits squarely in this Likelihood Approximation and Variance Reduction subgroup, sharing the goal of making policy gradients more stable and sample-efficient. Compared to LLaDA Variance Reduced[8], which emphasizes amortized baselines, Weighted Policy Optimization[0] explores weighting schemes that directly modulate gradient contributions across diffusion steps. This contrasts with Sandwiched Policy Gradient[16], which instead sandwiches the policy between upper and lower likelihood bounds. Across these neighboring works, the central trade-off remains balancing computational overhead against variance reduction, with each method offering a distinct lens on tractable credit assignment in diffusion language models.

Claimed Contributions

wd1: Weighted Policy Optimization for Diffusion Language Models

The authors propose wd1, a reinforcement learning method for diffusion-based large language models that eliminates the need for policy ratio computation in importance sampling. The method reformulates the RL objective as a weighted log-likelihood where weights balance increasing probability of high-advantage completions and decreasing probability of low-advantage ones, requiring only one likelihood approximation instead of three.

1 retrieved paper
Theoretical interpretation as energy-guided diffusion with unlearning

The authors establish a theoretical connection showing that their weighted policy optimization objective is equivalent to training an energy-guided discrete diffusion model where the energy function is the negative advantage, combined with unlearning of low-advantage samples. This provides formal justification for the method's design.

0 retrieved papers
wd1++: Denoising-stepwise weighted policy optimization

The authors extend their base method to wd1++, which leverages intermediate completions generated during the iterative denoising process rather than only using final outputs. This extension achieves state-of-the-art performance on mathematical reasoning benchmarks with fewer training steps and rollouts.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

wd1: Weighted Policy Optimization for Diffusion Language Models

The authors propose wd1, a reinforcement learning method for diffusion-based large language models that eliminates the need for policy ratio computation in importance sampling. The method reformulates the RL objective as a weighted log-likelihood where weights balance increasing probability of high-advantage completions and decreasing probability of low-advantage ones, requiring only one likelihood approximation instead of three.

Contribution

Theoretical interpretation as energy-guided diffusion with unlearning

The authors establish a theoretical connection showing that their weighted policy optimization objective is equivalent to training an energy-guided discrete diffusion model where the energy function is the negative advantage, combined with unlearning of low-advantage samples. This provides formal justification for the method's design.

Contribution

wd1++: Denoising-stepwise weighted policy optimization

The authors extend their base method to wd1++, which leverages intermediate completions generated during the iterative denoising process rather than only using final outputs. This extension achieves state-of-the-art performance on mathematical reasoning benchmarks with fewer training steps and rollouts.