Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion modelText to ImageSample Reward SoupsTraining-freeBlack-box alignment

Recent advances in inference-time alignment of diffusion models have shown reduced susceptibility to reward over-optimization. However, when aligning with multiple black-box reward functions, the number of required queries grows exponentially with the number of reward functions, making the alignment process highly inefficient. To address the challenge, we propose the first inference-time soup strategy, named Sample Reward Soups (SRSoup), for Pareto-optimal sampling across the entire space of preferences. Specifically, at each denoising step, we independently steer multiple denoising distributions using reward-guided search gradients (one for each reward function) and then linearly interpolate their search gradients. This design is effective because sample rewards can be shared when two denoising distributions are close, particularly during the early stages of the denoising process. As a result, SRSoup significantly reduces the number of queries required in the early stages without sacrificing performance. Extensive experiments demonstrate the effectiveness of SRSoup in aligning T2I models with diverse reward functions, establishing a practical and scalable solution.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Sample Reward Soups (SRSoup), an inference-time gradient interpolation method for multi-reward alignment in text-to-image diffusion models. According to the taxonomy, it resides in the 'Inference-Time Gradient Interpolation' leaf under 'Reward Interpolation and Composition'. Notably, this leaf contains only the original paper itself, with no sibling papers identified. This suggests the specific approach of interpolating reward-guided search gradients at each denoising step represents a relatively sparse research direction within the broader multi-reward alignment landscape, which encompasses eighteen papers across multiple branches.

The taxonomy reveals that neighboring research directions include 'Preference-Specific Expert Merging' (which trains separate experts and merges them) and 'Pareto-Optimal Multi-Reward Frameworks' (which use Pareto optimality principles). The broader 'Inference-Time Guidance and Search' branch contains methods like 'Reward-Guided Gradient Optimization' and 'Gradient-Free Search and Sampling'. SRSoup's gradient interpolation approach sits at the intersection of multi-objective optimization and inference-time guidance, distinguishing itself from training-time merging strategies and from pure search methods that avoid gradient computation entirely. The taxonomy's scope notes clarify that gradient interpolation methods differ from Pareto formulations and from embedding-based composition techniques.

Among twenty-nine candidates examined, the contribution-level analysis shows mixed novelty signals. The core SRSoup framework (nine candidates examined, zero refutable) and the query-efficient gradient interpolation mechanism (ten candidates examined, zero refutable) appear to have limited direct prior work within the search scope. However, the reward-guided sampling strategy for black-box alignment (ten candidates examined, four refutable) shows more substantial overlap with existing methods. This pattern suggests that while the specific gradient interpolation design may be novel, the underlying principle of using reward gradients to steer diffusion sampling has established precedents in the examined literature.

Based on the limited search scope of twenty-nine semantically similar papers, the work appears to occupy a relatively unexplored niche within inference-time multi-reward alignment. The absence of sibling papers in its taxonomy leaf and the low refutation rate for its core contributions suggest potential novelty, though the analysis does not cover exhaustive citation networks or domain-specific venues. The four refutable instances for reward-guided sampling indicate that foundational techniques are well-established, while the specific gradient interpolation mechanism represents a more distinctive contribution within the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-reward alignment for text-to-image diffusion models at inference time. The field addresses how to steer generative models toward multiple desirable properties—such as aesthetic quality, prompt adherence, and safety—without retraining. The taxonomy reveals several complementary branches: Multi-Objective Optimization Strategies explore how to balance competing rewards through interpolation or composition; Reward Modeling and Formulation focuses on designing and learning suitable reward signals, including multimodal or fine-grained feedback; Inference-Time Guidance and Search develops methods that apply reward gradients or search procedures during sampling; Mitigating Reward Over-Optimization tackles the risk of exploiting reward models to the detriment of image quality; and Component-Specific Optimization targets particular modules like text encoders or LoRA adapters. Together, these branches reflect a shift from monolithic training objectives toward flexible, modular alignment at generation time, as illustrated by works like Rewards in Context[1] and Inference Time Alignment Tutorial[4]. Within this landscape, a particularly active line of work examines how to combine reward signals on the fly. Sample Reward Soups[0] sits squarely in the Reward Interpolation and Composition cluster, proposing inference-time gradient interpolation to blend multiple objectives without expensive retraining. This approach contrasts with methods like Dynamic Search[5], which performs explicit search over candidate samples, and Mira[7], which learns to merge reward models in a more structured way. Meanwhile, Dense Reward View[3] and ReNO[6] emphasize richer, step-level feedback to guide the diffusion process more precisely. The central tension across these directions is balancing computational overhead against alignment fidelity: gradient-based interpolation offers efficiency but may require careful tuning to avoid reward hacking, whereas search-based or learned composition strategies can be more robust yet costlier. Sample Reward Soups[0] thus represents a pragmatic middle ground, leveraging gradient mixing to achieve multi-objective alignment with minimal inference-time complexity.

Claimed Contributions

Sample Reward Soups (SRSoup) for inference-time multi-reward alignment

9 retrieved papers

The authors introduce SRSoup, a training-free method that interpolates reward-guided search gradients from individual reward functions at each denoising step to achieve multi-objective alignment in text-to-image diffusion models. This approach enables Pareto-optimal sampling across different preference weightings without requiring model fine-tuning.

9 retrieved papers

Query-efficient gradient interpolation mechanism

10 retrieved papers

The method steers multiple denoising distributions independently using reward-guided search gradients and linearly interpolates them. This design exploits the observation that sample rewards can be shared when denoising distributions are close, particularly in early denoising stages, significantly reducing the number of required reward queries.

10 retrieved papers

Reward-guided sampling strategy for black-box reward alignment

Can Refute

10 retrieved papers

The authors propose a training-free guidance strategy that optimizes the stepwise denoising distributions of diffusion models using reward-guided search gradients derived from black-box reward functions, enabling alignment without requiring differentiable rewards or model fine-tuning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sample Reward Soups (SRSoup) for inference-time multi-reward alignment

[10] Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models PDF

Cannot Refute

[11] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation PDF

Cannot Refute

[13] Enhancing Diffusion Models with Text-Encoder Reinforcement Learning PDF

Cannot Refute

[19] Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance PDF

Cannot Refute

[20] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

Cannot Refute

[21] Versat2i: Improving text-to-image models with versatile reward PDF

Cannot Refute

[22] Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey PDF

Cannot Refute

[23] GLIDE: A Gradient-Free Lightweight Fine-tune Approach for Discrete Biological Sequence Design PDF

Cannot Refute

[24] Blending Concepts in Text-to-Image Diffusion Models using the Black Scholes Algorithm PDF

Cannot Refute

Contribution

Query-efficient gradient interpolation mechanism

[25] Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion PDF

Cannot Refute

[26] Diffusion posterior sampling for general noisy inverse problems PDF

Cannot Refute

[27] Blended Latent Diffusion PDF

Cannot Refute

[28] Blended Diffusion for Text-driven Editing of Natural Images PDF

Cannot Refute

[29] scDiffusion: conditional generation of high-quality single-cell data using diffusion model PDF

Cannot Refute

[30] Gradpaint: Gradient-Guided Inpainting with Diffusion Models PDF

Cannot Refute

[31] RDDM: A Rate-Distortion Guided Diffusion Model for Leaned Image Compression Enhancement PDF

Cannot Refute

[32] World Models via Policy-Guided Trajectory Diffusion PDF

Cannot Refute

[33] Restoration-Degradation Beyond Linear Diffusions: A Non-Asymptotic Analysis For DDIM-Type Samplers PDF

Cannot Refute

[34] Improving training efficiency of diffusion models via multi-stage framework and tailored multi-decoder architecture PDF

Cannot Refute

Contribution

Reward-guided sampling strategy for black-box reward alignment

[36] Training-free Diffusion Model Alignment with Sampling Demons PDF

Can Refute

[37] Test-time alignment of diffusion models without reward over-optimization PDF

Can Refute

[39] Reward-guided controlled generation for inference-time alignment in diffusion models: Tutorial and review PDF

Can Refute

[42] Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation PDF

Can Refute

[35] Training diffusion models with reinforcement learning PDF

Cannot Refute

[38] Diffusion Models for Black-Box Optimization PDF

Cannot Refute

[40] Learn to guide your diffusion model PDF

Cannot Refute

[41] A reward-directed diffusion framework for generative design optimization PDF

Cannot Refute

[43] Scalingnoise: Scaling inference-time search for generating infinite videos PDF

Cannot Refute

[44] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward PDF

Cannot Refute

Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Sample Reward Soups (SRSoup) for inference-time multi-reward alignment

[10] Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models PDF

[11] Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation PDF

[13] Enhancing Diffusion Models with Text-Encoder Reinforcement Learning PDF

[19] Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance PDF

[20] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

[21] Versat2i: Improving text-to-image models with versatile reward PDF

[22] Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey PDF

[23] GLIDE: A Gradient-Free Lightweight Fine-tune Approach for Discrete Biological Sequence Design PDF

[24] Blending Concepts in Text-to-Image Diffusion Models using the Black Scholes Algorithm PDF

Query-efficient gradient interpolation mechanism

[25] Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion PDF

[26] Diffusion posterior sampling for general noisy inverse problems PDF

[27] Blended Latent Diffusion PDF

[28] Blended Diffusion for Text-driven Editing of Natural Images PDF

[29] scDiffusion: conditional generation of high-quality single-cell data using diffusion model PDF

[30] Gradpaint: Gradient-Guided Inpainting with Diffusion Models PDF

[31] RDDM: A Rate-Distortion Guided Diffusion Model for Leaned Image Compression Enhancement PDF

[32] World Models via Policy-Guided Trajectory Diffusion PDF

[33] Restoration-Degradation Beyond Linear Diffusions: A Non-Asymptotic Analysis For DDIM-Type Samplers PDF

[34] Improving training efficiency of diffusion models via multi-stage framework and tailored multi-decoder architecture PDF

Reward-guided sampling strategy for black-box reward alignment

[36] Training-free Diffusion Model Alignment with Sampling Demons PDF

[37] Test-time alignment of diffusion models without reward over-optimization PDF

[39] Reward-guided controlled generation for inference-time alignment in diffusion models: Tutorial and review PDF

[42] Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation PDF

[35] Training diffusion models with reinforcement learning PDF

[38] Diffusion Models for Black-Box Optimization PDF

[40] Learn to guide your diffusion model PDF

[41] A reward-directed diffusion framework for generative design optimization PDF

[43] Scalingnoise: Scaling inference-time search for generating infinite videos PDF

[44] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward PDF

Table of Contents