wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Diffusion Language ModelsReinforcement LearningReasoning

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, \textit{wd1} outperforms diffusion-based GRPO (\textit{d1}) while requiring lower computational cost, achieving up to a $+59\%$ improvement in accuracy. Furthermore, we extend \textit{wd1} to denoising-stepwise weighted policy optimization (\algname++), achieving state-of-the-art math performance of $44.2\%$ on MATH500 and $84.5\%$ on GSM8K with only 20 RL training steps.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for diffusion-based large language models. The field has rapidly expanded into several major branches that reflect different emphases in adapting RL to discrete diffusion architectures. Policy Optimization Algorithms for Diffusion Language Models focuses on developing tractable gradient estimators and variance reduction techniques—such as likelihood approximation methods—that enable stable training of diffusion-based text generators under reward signals. Exploration and Inference Strategies examine how to guide the iterative denoising process at test time, while Multimodal and Unified Diffusion Architectures extend these ideas beyond text to vision and cross-modal settings. Meanwhile, Diffusion Models for Reinforcement Learning Tasks and Application-Specific Diffusion and RL Integration explore using diffusion as a planning or policy representation in traditional RL domains, and Reinforcement Learning for General Diffusion Model Alignment addresses broader safety and preference alignment questions. Surveys, Benchmarks, and Comparative Studies provide overarching perspectives, and Open-Source Diffusion Language Model Implementations offer practical tooling for the community. Within Policy Optimization, a dense cluster of works tackles the challenge of high-variance gradients inherent in discrete diffusion. LLaDA Variance Reduced[8] and Sandwiched Policy Gradient[16] exemplify efforts to bound or reduce variance through control variates and tighter likelihood bounds, while Boundary-Guided Policy[14] and d2 Improved Techniques[37] propose alternative parameterizations or training schedules. Weighted Policy Optimization[0] sits squarely in this Likelihood Approximation and Variance Reduction subgroup, sharing the goal of making policy gradients more stable and sample-efficient. Compared to LLaDA Variance Reduced[8], which emphasizes amortized baselines, Weighted Policy Optimization[0] explores weighting schemes that directly modulate gradient contributions across diffusion steps. This contrasts with Sandwiched Policy Gradient[16], which instead sandwiches the policy between upper and lower likelihood bounds. Across these neighboring works, the central trade-off remains balancing computational overhead against variance reduction, with each method offering a distinct lens on tractable credit assignment in diffusion language models.

Claimed Contributions

wd1: Weighted Policy Optimization for Diffusion Language Models

1 retrieved paper

The authors propose wd1, a reinforcement learning method for diffusion-based large language models that eliminates the need for policy ratio computation in importance sampling. The method reformulates the RL objective as a weighted log-likelihood where weights balance increasing probability of high-advantage completions and decreasing probability of low-advantage ones, requiring only one likelihood approximation instead of three.

1 retrieved paper

Theoretical interpretation as energy-guided diffusion with unlearning

0 retrieved papers

The authors establish a theoretical connection showing that their weighted policy optimization objective is equivalent to training an energy-guided discrete diffusion model where the energy function is the negative advantage, combined with unlearning of low-advantage samples. This provides formal justification for the method's design.

0 retrieved papers

wd1++: Denoising-stepwise weighted policy optimization

10 retrieved papers

The authors extend their base method to wd1++, which leverages intermediate completions generated during the iterative denoising process rather than only using final outputs. This extension achieves state-of-the-art performance on mathematical reasoning benchmarks with fewer training steps and rollouts.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF

Zhu Feng-qi, Wang Rong-zhen, Fengqi Zhu, Nie Shen, Rongzheng Wang, Zhang Xiaolu, Shen Nie, Wu ChÃ¼n-wei, Xiaolu Zhang, Hu Jun, Chunwei Wu, Zhou Jun, Jun Hu, Chen Jian-fei, Jun Zhou, Lin, Yankai, Jianfei Chen, Wen, Ji-Rong, Yankai Lin, Li, Chongxuan, Ji-Rong Wen, Chongxuan Li (2025) • arXiv.org

[14] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models PDF

Lin, Nianyi, Zhang Jiajie, Nianyi Lin, Hou Lei, Jiajie Zhang, Li, Juanzi, Lei Hou, Juanzi Li (2025) • arXiv.org

[16] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models PDF

Wang Chen-yu, Rashidinejad, Paria, Chengyu Wang, Su, DiJia, Paria Rashidinejad, Jiang Song, DiJia Su, Wang, Sid, Song Jiang, Zhao Siyan, Sid Wang, Zhou Cai, Siyan Zhao, Cai Zhou, Chen Fei-yu, Shannon Zejiang Shen, Jaakkola, Tommi, Feiyu Chen, Tian, Yuandong, T. Jaakkola, Liu Bo, Yuandong Tian, Wang Cheng-yu, Bo Liu (2025) • arXiv.org

[37] d2: Improved Techniques for Training Reasoning Diffusion Language Models PDF

Wang Guanghan, Schiff, Yair, Guanghan Wang, Yair Schiff, Kuleshov, Volodymyr, Gilad Turok, Volodymyr Kuleshov (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

wd1: Weighted Policy Optimization for Diffusion Language Models

[51] Advantage weighted matching: Aligning rl with pretraining in diffusion models PDF

Cannot Refute

Contribution

Theoretical interpretation as energy-guided diffusion with unlearning

Contribution

wd1++: Denoising-stepwise weighted policy optimization

[7] Inpainting-Guided Policy Optimization for Diffusion Large Language Models PDF

Cannot Refute

[9] Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models PDF

Cannot Refute

[18] Training Diffusion Models with Reinforcement Learning PDF

Cannot Refute

[53] Diffusion policy policy optimization PDF

Cannot Refute

[54] Efficient Online Reinforcement Learning for Diffusion Policy PDF

Cannot Refute

[55] D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss PDF

Cannot Refute

[56] FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models PDF

Cannot Refute

[57] Enhanced DACER Algorithm with High Diffusion Efficiency PDF

Cannot Refute

[58] Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models PDF

Cannot Refute

[59] Two-step diffusion policy deep reinforcement learning method for low-carbon multi-energy microgrid energy management PDF

Cannot Refute

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF

[14] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models PDF

[16] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models PDF

[37] d2: Improved Techniques for Training Reasoning Diffusion Language Models PDF

Contribution Analysis

wd1: Weighted Policy Optimization for Diffusion Language Models

[51] Advantage weighted matching: Aligning rl with pretraining in diffusion models PDF

Theoretical interpretation as energy-guided diffusion with unlearning

wd1++: Denoising-stepwise weighted policy optimization

[7] Inpainting-Guided Policy Optimization for Diffusion Large Language Models PDF

[9] Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models PDF

[18] Training Diffusion Models with Reinforcement Learning PDF

[53] Diffusion policy policy optimization PDF

[54] Efficient Online Reinforcement Learning for Diffusion Policy PDF

[55] D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss PDF

[56] FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models PDF

[57] Enhanced DACER Algorithm with High Diffusion Efficiency PDF

[58] Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models PDF

[59] Two-step diffusion policy deep reinforcement learning method for low-carbon multi-energy microgrid energy management PDF

Table of Contents