Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion modelText to ImageSample Reward SoupsTraining-freeBlack-box alignment
Abstract:

Recent advances in inference-time alignment of diffusion models have shown reduced susceptibility to reward over-optimization. However, when aligning with multiple black-box reward functions, the number of required queries grows exponentially with the number of reward functions, making the alignment process highly inefficient. To address the challenge, we propose the first inference-time soup strategy, named Sample Reward Soups (SRSoup), for Pareto-optimal sampling across the entire space of preferences. Specifically, at each denoising step, we independently steer multiple denoising distributions using reward-guided search gradients (one for each reward function) and then linearly interpolate their search gradients. This design is effective because sample rewards can be shared when two denoising distributions are close, particularly during the early stages of the denoising process. As a result, SRSoup significantly reduces the number of queries required in the early stages without sacrificing performance. Extensive experiments demonstrate the effectiveness of SRSoup in aligning T2I models with diverse reward functions, establishing a practical and scalable solution.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Sample Reward Soups (SRSoup), an inference-time gradient interpolation method for multi-reward alignment in text-to-image diffusion models. According to the taxonomy, it resides in the 'Inference-Time Gradient Interpolation' leaf under 'Reward Interpolation and Composition'. Notably, this leaf contains only the original paper itself, with no sibling papers identified. This suggests the specific approach of interpolating reward-guided search gradients at each denoising step represents a relatively sparse research direction within the broader multi-reward alignment landscape, which encompasses eighteen papers across multiple branches.

The taxonomy reveals that neighboring research directions include 'Preference-Specific Expert Merging' (which trains separate experts and merges them) and 'Pareto-Optimal Multi-Reward Frameworks' (which use Pareto optimality principles). The broader 'Inference-Time Guidance and Search' branch contains methods like 'Reward-Guided Gradient Optimization' and 'Gradient-Free Search and Sampling'. SRSoup's gradient interpolation approach sits at the intersection of multi-objective optimization and inference-time guidance, distinguishing itself from training-time merging strategies and from pure search methods that avoid gradient computation entirely. The taxonomy's scope notes clarify that gradient interpolation methods differ from Pareto formulations and from embedding-based composition techniques.

Among twenty-nine candidates examined, the contribution-level analysis shows mixed novelty signals. The core SRSoup framework (nine candidates examined, zero refutable) and the query-efficient gradient interpolation mechanism (ten candidates examined, zero refutable) appear to have limited direct prior work within the search scope. However, the reward-guided sampling strategy for black-box alignment (ten candidates examined, four refutable) shows more substantial overlap with existing methods. This pattern suggests that while the specific gradient interpolation design may be novel, the underlying principle of using reward gradients to steer diffusion sampling has established precedents in the examined literature.

Based on the limited search scope of twenty-nine semantically similar papers, the work appears to occupy a relatively unexplored niche within inference-time multi-reward alignment. The absence of sibling papers in its taxonomy leaf and the low refutation rate for its core contributions suggest potential novelty, though the analysis does not cover exhaustive citation networks or domain-specific venues. The four refutable instances for reward-guided sampling indicate that foundational techniques are well-established, while the specific gradient interpolation mechanism represents a more distinctive contribution within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
29
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Multi-reward alignment for text-to-image diffusion models at inference time. The field addresses how to steer generative models toward multiple desirable properties—such as aesthetic quality, prompt adherence, and safety—without retraining. The taxonomy reveals several complementary branches: Multi-Objective Optimization Strategies explore how to balance competing rewards through interpolation or composition; Reward Modeling and Formulation focuses on designing and learning suitable reward signals, including multimodal or fine-grained feedback; Inference-Time Guidance and Search develops methods that apply reward gradients or search procedures during sampling; Mitigating Reward Over-Optimization tackles the risk of exploiting reward models to the detriment of image quality; and Component-Specific Optimization targets particular modules like text encoders or LoRA adapters. Together, these branches reflect a shift from monolithic training objectives toward flexible, modular alignment at generation time, as illustrated by works like Rewards in Context[1] and Inference Time Alignment Tutorial[4]. Within this landscape, a particularly active line of work examines how to combine reward signals on the fly. Sample Reward Soups[0] sits squarely in the Reward Interpolation and Composition cluster, proposing inference-time gradient interpolation to blend multiple objectives without expensive retraining. This approach contrasts with methods like Dynamic Search[5], which performs explicit search over candidate samples, and Mira[7], which learns to merge reward models in a more structured way. Meanwhile, Dense Reward View[3] and ReNO[6] emphasize richer, step-level feedback to guide the diffusion process more precisely. The central tension across these directions is balancing computational overhead against alignment fidelity: gradient-based interpolation offers efficiency but may require careful tuning to avoid reward hacking, whereas search-based or learned composition strategies can be more robust yet costlier. Sample Reward Soups[0] thus represents a pragmatic middle ground, leveraging gradient mixing to achieve multi-objective alignment with minimal inference-time complexity.

Claimed Contributions

Sample Reward Soups (SRSoup) for inference-time multi-reward alignment

The authors introduce SRSoup, a training-free method that interpolates reward-guided search gradients from individual reward functions at each denoising step to achieve multi-objective alignment in text-to-image diffusion models. This approach enables Pareto-optimal sampling across different preference weightings without requiring model fine-tuning.

9 retrieved papers
Query-efficient gradient interpolation mechanism

The method steers multiple denoising distributions independently using reward-guided search gradients and linearly interpolates them. This design exploits the observation that sample rewards can be shared when denoising distributions are close, particularly in early denoising stages, significantly reducing the number of required reward queries.

10 retrieved papers
Reward-guided sampling strategy for black-box reward alignment

The authors propose a training-free guidance strategy that optimizes the stepwise denoising distributions of diffusion models using reward-guided search gradients derived from black-box reward functions, enabling alignment without requiring differentiable rewards or model fine-tuning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sample Reward Soups (SRSoup) for inference-time multi-reward alignment

The authors introduce SRSoup, a training-free method that interpolates reward-guided search gradients from individual reward functions at each denoising step to achieve multi-objective alignment in text-to-image diffusion models. This approach enables Pareto-optimal sampling across different preference weightings without requiring model fine-tuning.

Contribution

Query-efficient gradient interpolation mechanism

The method steers multiple denoising distributions independently using reward-guided search gradients and linearly interpolates them. This design exploits the observation that sample rewards can be shared when denoising distributions are close, particularly in early denoising stages, significantly reducing the number of required reward queries.

Contribution

Reward-guided sampling strategy for black-box reward alignment

The authors propose a training-free guidance strategy that optimizes the stepwise denoising distributions of diffusion models using reward-guided search gradients derived from black-box reward functions, enabling alignment without requiring differentiable rewards or model fine-tuning.