Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

perturbation-based gradient estimationdiffusion modelpost-training

The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR's gradient estimator an unbiased one with lower variance than other methods. We theoretically investigate the bias, variance, and convergence of our method. Extensive experiments are conducted on image and video generation to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Recursive Likelihood Ratio (RLR) optimizer for aligning diffusion models, positioning itself within the Reinforcement Learning-Based Alignment leaf of the taxonomy. This leaf contains five papers total, including the original work, indicating a moderately active research direction. The core contribution addresses gradient estimation challenges in RL-based fine-tuning by introducing a 'Half-Order' paradigm that claims unbiased gradient estimation with lower variance than existing truncated backpropagation or standard RL approaches. The work sits at the intersection of preference-based alignment and computational efficiency concerns.

The taxonomy reveals that Reinforcement Learning-Based Alignment is one of three sibling approaches under Preference-Based Alignment Methods, alongside Direct Preference Optimization (seven papers) and Trajectory and Sampling Optimization (two papers). Direct Preference Optimization represents a more crowded alternative direction that avoids explicit reward models, while the original paper's RL-based approach maintains reward signals but seeks to improve gradient estimation. Neighboring branches like Parameter-Efficient Adaptation (eight papers across two leaves) and Representation and Feature Alignment (seven papers) address orthogonal efficiency concerns through architectural modifications rather than training dynamics, suggesting the field explores multiple complementary strategies for alignment.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The RLR optimizer examined ten candidates with zero refutable overlaps, as did the systematic design space analysis and the Diffusive Chain-of-Thought prompt technique. This absence of refutation among the limited search scope suggests either genuine novelty in the specific gradient estimation approach or that the semantic search did not surface closely related variance reduction techniques in RL-based diffusion fine-tuning. The contribution-level statistics indicate consistent novelty signals across all three claimed innovations within the examined candidate set.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a distinct position within RL-based alignment by focusing on gradient estimator properties rather than reward model design or sampling strategies. However, the limited search scope means potentially relevant work in variance reduction for sequential decision-making or alternative gradient estimation techniques in generative modeling may not have been captured. The taxonomy context suggests this is an active but not overcrowded research direction with clear differentiation from adjacent preference optimization paradigms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient fine-tuning of diffusion models for alignment. The field has organized itself around several complementary strategies for adapting large-scale diffusion models to better match human preferences, domain requirements, or specific generation tasks without prohibitive computational costs. Preference-Based Alignment Methods leverage human feedback and reinforcement learning signals to steer model outputs toward desired qualities, often employing reward models or direct preference optimization. Parameter-Efficient Adaptation Techniques focus on minimizing trainable parameters through approaches like low-rank adaptors (IP-Adapter[4]) or structured gating mechanisms, enabling rapid customization with limited resources. Representation and Feature Alignment (Representation Alignment[3]) targets internal model representations to improve semantic consistency, while Domain-Specific Fine-Tuning Applications and Temporal and Video Generation Adaptation (AnimateDiff[5]) address specialized modalities such as medical imaging or coherent video synthesis. Specialized Fine-Tuning Paradigms explore novel training regimes including self-play (Self-Play Fine-Tuning[1]) and fairness constraints (Fairness Fine-Tuning[2]), and Conditional Generation Enhancement refines how models respond to complex or multi-modal conditioning signals. Within this landscape, reinforcement learning-based alignment has emerged as a particularly active area, balancing sample efficiency with the need for stable gradient signals from reward functions. Works like Reward Backpropagation[8] and Latent-Space Surrogate Reward[41] demonstrate contrasting strategies: some backpropagate rewards directly through the diffusion process, while others construct surrogate objectives in latent space to reduce computational overhead. The original paper, Recursive Likelihood Ratio[0], situates itself in this reinforcement learning-driven branch, proposing a method that recursively refines alignment signals to improve sample quality under human preferences. Compared to Human-Feedback Efficient[27], which emphasizes minimizing annotation burden, or MIRA[47], which integrates multi-modal reward signals, Recursive Likelihood Ratio[0] focuses on iteratively sharpening the likelihood-based feedback loop. This positioning highlights ongoing tensions between annotation efficiency, computational cost, and the fidelity of alignment to nuanced human judgments.

Claimed Contributions

Recursive Likelihood Ratio (RLR) optimizer for diffusion model fine-tuning

10 retrieved papers

The authors introduce a novel gradient estimation method called RLR that reorganizes the computation graph in diffusion models using a half-order approach. This method combines first-order, half-order, and zeroth-order gradient estimation strategies to achieve unbiased gradient estimation with lower variance compared to existing reinforcement learning and truncated backpropagation methods.

10 retrieved papers

Systematic design space analysis and variance minimization framework

10 retrieved papers

The authors characterize the full design space of unbiased gradient estimators for diffusion models and formulate a constrained optimization problem to minimize estimator variance under computational budget constraints. This framework guides the principled design of the RLR optimizer by optimizing the sub-chain length and starting position.

10 retrieved papers

Diffusive Chain-of-Thought (DCoT) prompt technique

10 retrieved papers

The authors propose a novel prompting technique that decomposes generation prompts into multi-scale levels (coarse, mid, and fine) to align with the coarse-to-fine generation process of diffusion models. This technique leverages the RLR's ability to target specific time steps for gradient updates, enabling focused improvements at particular generation scales.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Aligning text-to-image diffusion models with reward backpropagation PDF

Prabhudesai, Mihir, Goyal, Anirudh, Mihir Prabhudesai, Pathak, Deepak, Anirudh Goyal, Fragkiadaki, Katerina, Deepak Pathak, Katerina Fragkiadaki (2023)

[27] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning PDF

Chen, Shang-Fu, Ayano Hiranaka, Lai, Chieh-Hsin, Shang-Fu Chen, kim dongjun, Chieh-Hsin Lai, Murata, Naoki, Dongjun Kim, Shibuya Takashi, Naoki Murata, Liao, Wei-Hsiang, Takashi Shibuya, Sun Shao-hua, Wei-Hsiang Liao, Mitsufuji, Yuki, Shao-Hua Sun, Yuki Mitsufuji (2024) • International Conference on Learning Representations

[41] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward PDF

Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu (2024) • Computer Vision and Pattern Recognition

[47] MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models PDF

Singh, Utsav, Kevin Zhai, Thatipelli, Anirudh, Utsav Singh, Chakraborty, Souradip, Anirudh Thatipelli, Sahu Anit Kumar, Souradip Chakraborty, Huang Fu-rong, Anit Kumar Sahu, Bedi, Amrit Singh, Furong Huang, Shah, Mubarak, A. S. Bedi, Mubarak Shah (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Recursive Likelihood Ratio (RLR) optimizer for diffusion model fine-tuning

[71] Rao-blackwell gradient estimators for equivariant denoising diffusion PDF

Cannot Refute

[72] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF

Cannot Refute

[73] Efficient Personalization of Quantized Diffusion Model without Backpropagation PDF

Cannot Refute

[74] Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models PDF

Cannot Refute

[75] The diffusion duality PDF

Cannot Refute

[76] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

Cannot Refute

[77] A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning PDF

Cannot Refute

[78] DiffPPO: Reinforcement Learning Fine-Tuning of Diffusion Models for Text-to-Image Generation PDF

Cannot Refute

[79] Directly fine-tuning diffusion models on differentiable rewards PDF

Cannot Refute

[80] Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization PDF

Cannot Refute

Contribution

Systematic design space analysis and variance minimization framework

[51] Gradient estimation for binary latent variables via gradient variance clipping PDF

Cannot Refute

[52] Conjugate Gradient and Variance Reduction Based Online ADMM for Low-Rank Distributed Networks PDF

Cannot Refute

[53] Vargrad: a low-variance gradient estimator for variational inference PDF

Cannot Refute

[54] Reducing noise in GAN training with variance reduced extragradient PDF

Cannot Refute

[55] Doubly reparameterized gradient estimators for monte carlo objectives PDF

Cannot Refute

[56] On divergence measures for training gflownets PDF

Cannot Refute

[57] On distinguishability criteria for estimating generative models PDF

Cannot Refute

[58] Gradient estimation using stochastic computation graphs PDF

Cannot Refute

[59] General inertial proximal stochastic variance reduction gradient for nonconvex nonsmooth optimization PDF

Cannot Refute

[60] Unbiased gradient estimation for variational auto-encoders using coupled Markov chains PDF

Cannot Refute

Contribution

Diffusive Chain-of-Thought (DCoT) prompt technique

[61] Fusing multi-scale attention mechanisms with diffusion models for virtual try-on PDF

Cannot Refute

[62] Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models PDF

Cannot Refute

[63] Hierarchical fashion design with multi-stage diffusion models PDF

Cannot Refute

[64] HieraFashDiff: Hierarchical Fashion Design with Multi-stage Diffusion Models PDF

Cannot Refute

[65] Robust Polyp Detection and Diagnosis through Compositional Prompt-Guided Diffusion Models PDF

Cannot Refute

[66] A comprehensive survey of image generation models based on deep learning PDF

Cannot Refute

[67] A Multi-Scale Deep Back-Projection Backbone for Face Super-Resolution with Diffusion Models PDF

Cannot Refute

[68] Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation PDF

Cannot Refute

[69] Enhancing image generation fidelity via progressive prompts PDF

Cannot Refute

[70] VideoBooth: Diffusion-based Video Generation with Image Prompts PDF

Cannot Refute

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Aligning text-to-image diffusion models with reward backpropagation PDF

[27] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning PDF

[41] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward PDF

[47] MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models PDF

Contribution Analysis

Recursive Likelihood Ratio (RLR) optimizer for diffusion model fine-tuning

[71] Rao-blackwell gradient estimators for equivariant denoising diffusion PDF

[72] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF

[73] Efficient Personalization of Quantized Diffusion Model without Backpropagation PDF

[74] Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models PDF

[75] The diffusion duality PDF

[76] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

[77] A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning PDF

[78] DiffPPO: Reinforcement Learning Fine-Tuning of Diffusion Models for Text-to-Image Generation PDF

[79] Directly fine-tuning diffusion models on differentiable rewards PDF

[80] Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization PDF

Systematic design space analysis and variance minimization framework

[51] Gradient estimation for binary latent variables via gradient variance clipping PDF

[52] Conjugate Gradient and Variance Reduction Based Online ADMM for Low-Rank Distributed Networks PDF

[53] Vargrad: a low-variance gradient estimator for variational inference PDF

[54] Reducing noise in GAN training with variance reduced extragradient PDF

[55] Doubly reparameterized gradient estimators for monte carlo objectives PDF

[56] On divergence measures for training gflownets PDF

[57] On distinguishability criteria for estimating generative models PDF

[58] Gradient estimation using stochastic computation graphs PDF

[59] General inertial proximal stochastic variance reduction gradient for nonconvex nonsmooth optimization PDF

[60] Unbiased gradient estimation for variational auto-encoders using coupled Markov chains PDF

Diffusive Chain-of-Thought (DCoT) prompt technique

[61] Fusing multi-scale attention mechanisms with diffusion models for virtual try-on PDF

[62] Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models PDF

[63] Hierarchical fashion design with multi-stage diffusion models PDF

[64] HieraFashDiff: Hierarchical Fashion Design with Multi-stage Diffusion Models PDF

[65] Robust Polyp Detection and Diagnosis through Compositional Prompt-Guided Diffusion Models PDF

[66] A comprehensive survey of image generation models based on deep learning PDF

[67] A Multi-Scale Deep Back-Projection Backbone for Face Super-Resolution with Diffusion Models PDF

[68] Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation PDF

[69] Enhancing image generation fidelity via progressive prompts PDF

[70] VideoBooth: Diffusion-based Video Generation with Image Prompts PDF

Table of Contents