wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose wd1, a reinforcement learning method for diffusion-based large language models that eliminates the need for policy ratio computation in importance sampling. The method reformulates the RL objective as a weighted log-likelihood where weights balance increasing probability of high-advantage completions and decreasing probability of low-advantage ones, requiring only one likelihood approximation instead of three.
The authors establish a theoretical connection showing that their weighted policy optimization objective is equivalent to training an energy-guided discrete diffusion model where the energy function is the negative advantage, combined with unlearning of low-advantage samples. This provides formal justification for the method's design.
The authors extend their base method to wd1++, which leverages intermediate completions generated during the iterative denoising process rather than only using final outputs. This extension achieves state-of-the-art performance on mathematical reasoning benchmarks with fewer training steps and rollouts.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[8] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF
[14] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models PDF
[16] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models PDF
[37] d2: Improved Techniques for Training Reasoning Diffusion Language Models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
wd1: Weighted Policy Optimization for Diffusion Language Models
The authors propose wd1, a reinforcement learning method for diffusion-based large language models that eliminates the need for policy ratio computation in importance sampling. The method reformulates the RL objective as a weighted log-likelihood where weights balance increasing probability of high-advantage completions and decreasing probability of low-advantage ones, requiring only one likelihood approximation instead of three.
[51] Advantage weighted matching: Aligning rl with pretraining in diffusion models PDF
Theoretical interpretation as energy-guided diffusion with unlearning
The authors establish a theoretical connection showing that their weighted policy optimization objective is equivalent to training an energy-guided discrete diffusion model where the energy function is the negative advantage, combined with unlearning of low-advantage samples. This provides formal justification for the method's design.
wd1++: Denoising-stepwise weighted policy optimization
The authors extend their base method to wd1++, which leverages intermediate completions generated during the iterative denoising process rather than only using final outputs. This extension achieves state-of-the-art performance on mathematical reasoning benchmarks with fewer training steps and rollouts.