Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion ModelReinforcement LearningMulti-Objective Finetuning
Abstract:

Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with three algorithms: DB-MPA for multi-reward alignment, DB-KLA for KL regularization control, and DB-MPA-LS for approximating DB-MPA without additional inference cost. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Diffusion Blend, a framework for inference-time multi-preference alignment that enables dynamic composition of multiple reward functions and KL regularization strengths without additional fine-tuning. It resides in the 'Reward-Guided Inference-Time Alignment' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Inference-Time Alignment Methods' branch, indicating a moderately populated research direction focused on steering generation during sampling rather than through model retraining. The taxonomy reveals this is an active but not overcrowded area, with sibling work exploring related reward-guided and test-time adaptation strategies.

The taxonomy structure shows that Diffusion Blend's leaf is adjacent to 'Test-Time Preference Adaptation' (three papers) within the same parent branch, and neighbors the 'Training-Based Alignment Methods' branch, which includes 'Direct Preference Optimization for Diffusion' (six papers) and 'Multi-Dimensional Preference Alignment' (four papers). The scope notes clarify that inference-time methods like Diffusion Blend differ from training-based approaches by avoiding model weight updates, and from multi-objective optimization frameworks by focusing on reward-guided steering rather than Pareto-optimal solution generation. This positioning suggests the work bridges inference-time flexibility with multi-preference handling, a boundary less densely explored than single-objective training methods.

Among nine candidates examined, the 'Inference-time multi-preference alignment problem formulation' contribution shows two refutable candidates out of eight examined, indicating some prior work addresses similar problem settings within the limited search scope. The 'Diffusion Blend framework and algorithms' contribution examined one candidate with no refutations, suggesting the specific blending mechanism may be more distinctive. The 'Theoretical approximation for control term' contribution examined zero candidates, leaving its novelty unassessed by this analysis. These statistics reflect a focused search rather than exhaustive coverage, so the presence of two refutable candidates for the problem formulation does not definitively establish lack of novelty but signals overlapping prior work exists among top semantic matches.

Based on the limited search scope of nine candidates, the work appears to occupy a moderately explored niche within inference-time alignment, with the problem formulation showing some overlap with existing methods but the algorithmic approach potentially more distinctive. The taxonomy context reveals this sits in an active but not saturated research direction, with clear boundaries separating it from training-based and Pareto-optimization approaches. The analysis covers top semantic matches and does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
9
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: inference-time multi-preference alignment for diffusion models. The field addresses how to steer diffusion models toward multiple, potentially conflicting objectives without retraining. The taxonomy organizes work into four main branches. Inference-Time Alignment Methods focus on guiding generation during sampling, often using reward signals or gradient-based steering to balance competing preferences on the fly. Training-Based Alignment Methods instead modify model weights or learn auxiliary modules to encode preferences, trading flexibility for potentially stronger alignment. Multi-Objective Optimization Frameworks provide algorithmic tools—such as Pareto front approximation or scalarization strategies—that can be applied at either training or inference time. Domain-Specific Applications demonstrate these techniques in specialized contexts like molecular design, materials engineering, and image restoration, where multiple design criteria must be satisfied simultaneously. A particularly active line of work explores reward-guided inference-time steering, where methods like Test-time Alignment[1] and Reward-guided Tutorial[2] adjust sampling trajectories using differentiable reward models. These approaches contrast with training-free multi-objective frameworks such as Training-free Multi-objective[5], which combine multiple objectives without gradient-based guidance. Diffusion Blend[0] sits squarely within the reward-guided inference-time cluster, emphasizing dynamic blending of multiple reward signals during generation. Compared to DyMO[15] and MIRA[16]—neighbors that also tackle multi-preference scenarios—Diffusion Blend[0] focuses on flexible, on-the-fly composition rather than pre-trained preference encodings. A central open question across these branches is how to efficiently navigate trade-offs among conflicting objectives while maintaining sample quality, especially when the number of preferences scales or when domain-specific constraints arise.

Claimed Contributions

Inference-time multi-preference alignment problem formulation

The authors formalize a new problem where diffusion models must align with arbitrary user-specified linear combinations of multiple reward functions and varying KL regularization strengths at inference time, without additional fine-tuning. This extends beyond standard single-reward alignment to accommodate diverse and dynamic user preferences.

8 retrieved papers
Can Refute
Diffusion Blend framework and algorithms

The authors introduce Diffusion Blend, a principled method that blends backward diffusion trajectories from reward-specific fine-tuned models. They propose three concrete algorithms: DB-MPA enables multi-reward alignment, DB-KLA provides KL regularization control, and DB-MPA-LS achieves similar performance without extra inference overhead.

1 retrieved paper
Theoretical approximation for control term in backward diffusion

The authors derive a theoretical result showing that the backward diffusion for any reward combination can be expressed via a control term, and they propose an approximation that decomposes this term into contributions from basis reward models. This enables blending of fine-tuned models to achieve arbitrary preference alignment without retraining.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Inference-time multi-preference alignment problem formulation

The authors formalize a new problem where diffusion models must align with arbitrary user-specified linear combinations of multiple reward functions and varying KL regularization strengths at inference time, without additional fine-tuning. This extends beyond standard single-reward alignment to accommodate diverse and dynamic user preferences.

Contribution

Diffusion Blend framework and algorithms

The authors introduce Diffusion Blend, a principled method that blends backward diffusion trajectories from reward-specific fine-tuned models. They propose three concrete algorithms: DB-MPA enables multi-reward alignment, DB-KLA provides KL regularization control, and DB-MPA-LS achieves similar performance without extra inference overhead.

Contribution

Theoretical approximation for control term in backward diffusion

The authors derive a theoretical result showing that the backward diffusion for any reward combination can be expressed via a control term, and they propose an approximation that decomposes this term into contributions from basis reward models. This enables blending of fine-tuned models to achieve arbitrary preference alignment without retraining.

Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models | Novelty Validation