Towards Better Optimization For Listwise Preference in Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Text-to-image generationDiffusion Model Alignment
Abstract:

Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett–Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Diffusion-LPO, a framework for optimizing text-to-image diffusion models using ranked lists of images rather than pairwise comparisons. It resides in the 'Listwise and Ranking-Based Optimization' leaf, which contains only three papers total including this work. This represents a relatively sparse research direction within the broader taxonomy of 29 papers across the field, suggesting that listwise optimization for diffusion models remains an emerging area with limited prior exploration compared to more established branches like pairwise methods or reward-based approaches.

The taxonomy reveals that this work sits within the 'Direct Preference Optimization Variants' branch, which also includes sibling categories for pairwise methods, curriculum strategies, and safeguarded optimization. Neighboring branches explore reward model training, classifier guidance, and rich feedback signals. The scope note for this leaf explicitly focuses on 'ranking models' and 'multiple alternatives simultaneously,' distinguishing it from pairwise-only approaches in adjacent leaves. This positioning suggests the paper addresses a gap between simple binary comparisons and more complex multi-signal methods, occupying a middle ground that leverages ranking structure without requiring detailed critiques or editing instructions.

Among 30 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core Diffusion-LPO framework (10 candidates examined, 0 refutable) appears relatively novel within the limited search scope. However, the listwise extension of DPO under the Plackett-Luce model (10 candidates examined, 3 refutable) shows substantial prior work overlap, indicating that mathematical formulations combining DPO with ranking models have been explored previously. The method for constructing listwise preferences from pairwise annotations (10 candidates examined, 0 refutable) appears more distinctive, though the search scope remains constrained to top-30 semantic matches.

Based on this limited literature search, the work appears to make incremental contributions to an emerging research direction. The framework-level novelty is clearer than the underlying mathematical formulation, where prior ranking-based DPO variants exist. The analysis covers top-30 semantic candidates and does not claim exhaustive coverage of all relevant prior work in preference optimization or ranking theory more broadly.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Listwise preference optimization for text-to-image diffusion models. The field has organized itself around several complementary branches that together address how to align generative image models with human preferences. The Preference Learning Objectives and Optimization Methods branch explores algorithmic innovations—ranging from direct preference optimization variants like Scalable Ranked Preference[1] and Ranking Implicit Feedback[12] to curriculum-based and collaborative strategies such as Curriculum DPO[11] and Collaborative DPO[9]—that refine how models learn from comparative feedback. Meanwhile, Preference Data Collection and Dataset Construction focuses on gathering high-quality human judgments, exemplified by datasets like Pick-a-Pic[4] and reward models such as ImageReward[3]. Task-Specific and Personalized Preference Alignment investigates tailoring models to individual users or specialized domains, as seen in Personalized Preference Finetuning[5] and Subject-driven RL[25]. The Evaluation and Benchmarking Methods branch develops metrics and protocols to measure alignment quality, while Prompt Engineering and Generation Systems address how to elicit desired outputs through better input design, illustrated by works like DiffusionGPT[7] and Automated Prompt Generation[18]. Within the optimization methods landscape, a particularly active line of work contrasts pairwise versus listwise ranking approaches: while many early methods compare two images at a time, recent efforts explore richer multi-candidate signals to capture nuanced preference structures. Listwise Preference Optimization[0] sits squarely in this emerging direction, emphasizing ranking-based objectives that leverage multiple outputs simultaneously—an approach closely related to Scalable Ranked Preference[1] and Ranking Implicit Feedback[12], which similarly exploit ordered lists rather than binary comparisons. This contrasts with curriculum or collaborative strategies like Curriculum DPO[11] that focus on sample ordering or multi-model coordination rather than listwise loss formulations. The central trade-off revolves around computational cost versus the expressiveness of preference signals: listwise methods can capture finer distinctions but require handling larger candidate sets, whereas pairwise techniques remain simpler yet may miss subtle ranking information that human annotators naturally provide.

Claimed Contributions

Diffusion-LPO framework for listwise preference optimization

The authors introduce Diffusion-LPO, a framework that extends Direct Preference Optimization to handle ranked lists of images rather than just pairwise comparisons. It uses the Plackett-Luce model to enforce consistency across entire rankings, encouraging each sample to be preferred over all lower-ranked alternatives.

10 retrieved papers
Listwise extension of DPO objective under Plackett-Luce model

The authors derive a new training objective that generalizes the pairwise DPO loss to listwise rankings by modeling preferences with the Plackett-Luce probabilistic ranking model, which captures the full relative ordering within preference lists.

10 retrieved papers
Can Refute
Method for constructing listwise preferences from pairwise annotations

The authors present a method to extract implicit ranking information from existing pairwise preference datasets by aggregating transitive preference relations into ranked lists, revealing that 56% of annotations in Pick-a-Pic can form rankings larger than pairs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Diffusion-LPO framework for listwise preference optimization

The authors introduce Diffusion-LPO, a framework that extends Direct Preference Optimization to handle ranked lists of images rather than just pairwise comparisons. It uses the Plackett-Luce model to enforce consistency across entire rankings, encouraging each sample to be preferred over all lower-ranked alternatives.

Contribution

Listwise extension of DPO objective under Plackett-Luce model

The authors derive a new training objective that generalizes the pairwise DPO loss to listwise rankings by modeling preferences with the Plackett-Luce probabilistic ranking model, which captures the full relative ordering within preference lists.

Contribution

Method for constructing listwise preferences from pairwise annotations

The authors present a method to extract implicit ranking information from existing pairwise preference datasets by aggregating transitive preference relations into ranked lists, revealing that 56% of annotations in Pick-a-Pic can form rankings larger than pairs.