Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

ICLR 2026 Conference SubmissionAnonymous Authors
perturbation-based gradient estimationdiffusion modelpost-training
Abstract:

The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR's gradient estimator an unbiased one with lower variance than other methods. We theoretically investigate the bias, variance, and convergence of our method. Extensive experiments are conducted on image and video generation to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Recursive Likelihood Ratio (RLR) optimizer for aligning diffusion models, positioning itself within the Reinforcement Learning-Based Alignment leaf of the taxonomy. This leaf contains five papers total, including the original work, indicating a moderately active research direction. The core contribution addresses gradient estimation challenges in RL-based fine-tuning by introducing a 'Half-Order' paradigm that claims unbiased gradient estimation with lower variance than existing truncated backpropagation or standard RL approaches. The work sits at the intersection of preference-based alignment and computational efficiency concerns.

The taxonomy reveals that Reinforcement Learning-Based Alignment is one of three sibling approaches under Preference-Based Alignment Methods, alongside Direct Preference Optimization (seven papers) and Trajectory and Sampling Optimization (two papers). Direct Preference Optimization represents a more crowded alternative direction that avoids explicit reward models, while the original paper's RL-based approach maintains reward signals but seeks to improve gradient estimation. Neighboring branches like Parameter-Efficient Adaptation (eight papers across two leaves) and Representation and Feature Alignment (seven papers) address orthogonal efficiency concerns through architectural modifications rather than training dynamics, suggesting the field explores multiple complementary strategies for alignment.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The RLR optimizer examined ten candidates with zero refutable overlaps, as did the systematic design space analysis and the Diffusive Chain-of-Thought prompt technique. This absence of refutation among the limited search scope suggests either genuine novelty in the specific gradient estimation approach or that the semantic search did not surface closely related variance reduction techniques in RL-based diffusion fine-tuning. The contribution-level statistics indicate consistent novelty signals across all three claimed innovations within the examined candidate set.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a distinct position within RL-based alignment by focusing on gradient estimator properties rather than reward model design or sampling strategies. However, the limited search scope means potentially relevant work in variance reduction for sequential decision-making or alternative gradient estimation techniques in generative modeling may not have been captured. The taxonomy context suggests this is an active but not overcrowded research direction with clear differentiation from adjacent preference optimization paradigms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: efficient fine-tuning of diffusion models for alignment. The field has organized itself around several complementary strategies for adapting large-scale diffusion models to better match human preferences, domain requirements, or specific generation tasks without prohibitive computational costs. Preference-Based Alignment Methods leverage human feedback and reinforcement learning signals to steer model outputs toward desired qualities, often employing reward models or direct preference optimization. Parameter-Efficient Adaptation Techniques focus on minimizing trainable parameters through approaches like low-rank adaptors (IP-Adapter[4]) or structured gating mechanisms, enabling rapid customization with limited resources. Representation and Feature Alignment (Representation Alignment[3]) targets internal model representations to improve semantic consistency, while Domain-Specific Fine-Tuning Applications and Temporal and Video Generation Adaptation (AnimateDiff[5]) address specialized modalities such as medical imaging or coherent video synthesis. Specialized Fine-Tuning Paradigms explore novel training regimes including self-play (Self-Play Fine-Tuning[1]) and fairness constraints (Fairness Fine-Tuning[2]), and Conditional Generation Enhancement refines how models respond to complex or multi-modal conditioning signals. Within this landscape, reinforcement learning-based alignment has emerged as a particularly active area, balancing sample efficiency with the need for stable gradient signals from reward functions. Works like Reward Backpropagation[8] and Latent-Space Surrogate Reward[41] demonstrate contrasting strategies: some backpropagate rewards directly through the diffusion process, while others construct surrogate objectives in latent space to reduce computational overhead. The original paper, Recursive Likelihood Ratio[0], situates itself in this reinforcement learning-driven branch, proposing a method that recursively refines alignment signals to improve sample quality under human preferences. Compared to Human-Feedback Efficient[27], which emphasizes minimizing annotation burden, or MIRA[47], which integrates multi-modal reward signals, Recursive Likelihood Ratio[0] focuses on iteratively sharpening the likelihood-based feedback loop. This positioning highlights ongoing tensions between annotation efficiency, computational cost, and the fidelity of alignment to nuanced human judgments.

Claimed Contributions

Recursive Likelihood Ratio (RLR) optimizer for diffusion model fine-tuning

The authors introduce a novel gradient estimation method called RLR that reorganizes the computation graph in diffusion models using a half-order approach. This method combines first-order, half-order, and zeroth-order gradient estimation strategies to achieve unbiased gradient estimation with lower variance compared to existing reinforcement learning and truncated backpropagation methods.

10 retrieved papers
Systematic design space analysis and variance minimization framework

The authors characterize the full design space of unbiased gradient estimators for diffusion models and formulate a constrained optimization problem to minimize estimator variance under computational budget constraints. This framework guides the principled design of the RLR optimizer by optimizing the sub-chain length and starting position.

10 retrieved papers
Diffusive Chain-of-Thought (DCoT) prompt technique

The authors propose a novel prompting technique that decomposes generation prompts into multi-scale levels (coarse, mid, and fine) to align with the coarse-to-fine generation process of diffusion models. This technique leverages the RLR's ability to target specific time steps for gradient updates, enabling focused improvements at particular generation scales.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Recursive Likelihood Ratio (RLR) optimizer for diffusion model fine-tuning

The authors introduce a novel gradient estimation method called RLR that reorganizes the computation graph in diffusion models using a half-order approach. This method combines first-order, half-order, and zeroth-order gradient estimation strategies to achieve unbiased gradient estimation with lower variance compared to existing reinforcement learning and truncated backpropagation methods.

Contribution

Systematic design space analysis and variance minimization framework

The authors characterize the full design space of unbiased gradient estimators for diffusion models and formulate a constrained optimization problem to minimize estimator variance under computational budget constraints. This framework guides the principled design of the RLR optimizer by optimizing the sub-chain length and starting position.

Contribution

Diffusive Chain-of-Thought (DCoT) prompt technique

The authors propose a novel prompting technique that decomposes generation prompts into multi-scale levels (coarse, mid, and fine) to align with the coarse-to-fine generation process of diffusion models. This technique leverages the RLR's ability to target specific time steps for gradient updates, enabling focused improvements at particular generation scales.