Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion ModelsRL Finetuning

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity. Our code is available at https://anonymous.4open.science/r/SQDF-B66C

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SQDF, a KL-regularized RL method using soft Q-function estimation for diffusion model alignment. It resides in the 'Soft Q-Function and Value-Based Methods' leaf, which contains only two papers total (including this work). This leaf sits within the broader 'Policy Gradient and Direct RL Approaches' branch, which encompasses foundational policy gradient methods, advanced credit assignment techniques, and continuous-time formulations. The sparse population of this specific leaf suggests that value-based approaches remain relatively underexplored compared to policy gradient methods in diffusion alignment.

The taxonomy reveals that most fine-tuning work concentrates in adjacent leaves: 'Foundational Policy Gradient Methods' (3 papers), 'Advanced Credit Assignment and Trajectory-Level Optimization' (3 papers), and 'Continuous-Time and Stochastic Control Formulations' (2 papers). The sibling paper in this leaf, Advantage Weighted Matching, also leverages advantage estimates but differs in optimization formulation. Neighboring branches include preference-based alignment methods and gradient-based backpropagation approaches, which avoid explicit RL formulations. The taxonomy's scope and exclude notes clarify that this leaf focuses specifically on value function estimation, distinguishing it from pure policy gradient techniques.

Among nine candidates examined, the core SQDF method (Contribution A) shows one refutable candidate from seven examined, suggesting some prior overlap in soft Q-function approaches for diffusion alignment. The three stabilization techniques (Contribution B) examined two candidates with no refutations, indicating these enhancements may be more novel. The training-free Q-function approximation (Contribution C) examined zero candidates, leaving its novelty unassessed within this limited search. The statistics reflect a focused semantic search rather than exhaustive coverage, so contributions without clear refutations may still have undiscovered prior work.

Given the limited search scope (nine candidates total), this analysis captures immediate semantic neighbors but cannot rule out relevant work outside the top-K matches. The sparse leaf population and single refutable pair suggest SQDF occupies a relatively less-crowded methodological niche, though the field's rapid growth means recent preprints or concurrent work may not appear in this snapshot. The enhancement techniques appear more distinctive within the examined literature than the core soft Q-function framework.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: alignment of diffusion models with downstream objectives via reinforcement learning. The field organizes around several major branches that reflect different strategic emphases. Fine-tuning methods and algorithmic frameworks encompass policy gradient and value-based techniques that directly optimize diffusion model parameters, with works like Training Diffusion RL[1] and DPOK[5] exemplifying direct RL approaches. Inference-time alignment and guidance methods, such as Inference Time Alignment[6] and Inference Time Control[12], adjust generation without retraining by steering the sampling process. Domain-specific applications demonstrate how these alignment strategies translate to robotics, medical imaging, and other specialized settings, while a growing body of survey and tutorial works, including RL Finetuning Tutorial[2] and Preference Alignment Survey[17], synthesizes emerging best practices. Related alignment and optimization methods explore connections to preference learning and divergence minimization, and a separate branch addresses diffusion models for reinforcement learning tasks, where generative models serve as policy representations or world models. Within fine-tuning frameworks, a particularly active line of research contrasts policy gradient methods with value-based and soft Q-function approaches. Policy gradient techniques often face challenges with high variance and credit assignment across diffusion timesteps, prompting explorations of advantage weighting and variance reduction. Value-based methods, including soft Q-function formulations, offer an alternative by learning action-value estimates to guide alignment. Reparameterized Policy Gradient[0] sits squarely in this value-based cluster, emphasizing soft Q-function techniques to stabilize gradient estimation. It shares methodological kinship with Advantage Weighted Matching[29], which also leverages advantage estimates to refine diffusion policies, yet differs in how it reparameterizes the optimization objective. These approaches collectively address the tension between sample efficiency and stability, a central open question as practitioners scale alignment to large models and diverse reward signals.

Claimed Contributions

Soft Q-based Diffusion Finetuning (SQDF) method

Can Refute

7 retrieved papers

The authors introduce SQDF, a KL-regularized reinforcement learning framework that fine-tunes diffusion models using a reparameterized policy gradient guided by a training-free soft Q-function approximation. This approach avoids unstable value function training and enables direct use of reward gradients for low-variance policy updates while mitigating reward over-optimization.

7 retrieved papers

Can Refute

Three stabilization and enhancement techniques for SQDF

2 retrieved papers

The authors propose three complementary techniques to improve SQDF: (1) a discount factor gamma to downweight early denoising steps for better credit assignment, (2) consistency models to provide more accurate soft Q-function estimates across all timesteps, and (3) an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off.

2 retrieved papers

Training-free soft Q-function approximation via posterior mean

0 retrieved papers

The authors develop a training-free approximation of the soft Q-function using single-step posterior mean estimation derived from Tweedie's formula. This approximation is differentiable under parameterized reward models, enabling direct gradient-based policy updates without requiring explicit value function training.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

Ge, Chongjian, Shuchen Xue, Zhang Shilong, Chongjian Ge, Li Yichen, Shilong Zhang, MA Zhi-ming, Yichen Li, Zhi-Ming Ma (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft Q-based Diffusion Finetuning (SQDF) method

[5] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

Can Refute

[53] Amortizing intractable inference in diffusion models for vision, language, and control PDF

Cannot Refute

[54] Forward kl regularized preference optimization for aligning diffusion policies PDF

Cannot Refute

[55] PADRE: Pseudo-likelihood based alignment of diffusion language models PDF

Cannot Refute

[56] Residual Policy Gradient: A Reward View of KL-regularized Objective PDF

Cannot Refute

[57] Value Diffusion Reinforcement Learning PDF

Cannot Refute

[58] Controllable Diffusion via Optimal Classifier Guidance PDF

Cannot Refute

Contribution

Three stabilization and enhancement techniques for SQDF

[51] Regularized conditional diffusion model for multi-task preference alignment PDF

Cannot Refute

[52] Distributed and Controllable Mobile Text-to-Image Generation with User Preference Guarantee PDF

Cannot Refute

Contribution

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

Contribution Analysis

Soft Q-based Diffusion Finetuning (SQDF) method

[5] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

[53] Amortizing intractable inference in diffusion models for vision, language, and control PDF

[54] Forward kl regularized preference optimization for aligning diffusion policies PDF

[55] PADRE: Pseudo-likelihood based alignment of diffusion language models PDF

[56] Residual Policy Gradient: A Reward View of KL-regularized Objective PDF

[57] Value Diffusion Reinforcement Learning PDF

[58] Controllable Diffusion via Optimal Classifier Guidance PDF

Three stabilization and enhancement techniques for SQDF

[51] Regularized conditional diffusion model for multi-task preference alignment PDF

[52] Distributed and Controllable Mobile Text-to-Image Generation with User Preference Guarantee PDF

Training-free soft Q-function approximation via posterior mean

Table of Contents