ViPO: Visual Preference Optimization at Scale

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion ModelImage GenerationVideo GenerationVisual GenerationDPO
Abstract:

While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm for visual generation remains largely unexplored. Current open-source preference datasets typically contain substantial conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn meaningful preferences, fundamentally hindering effective scaling. To enhance the robustness of preference algorithms against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence during training based on dataset characteristics, enabling effective learning across diverse data distributions from noisy to trivially simple patterns. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling key data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs (1024px) across five categories and 300K video pairs (720p+) across three categories. Leveraging state-of-the-art generative models and diverse prompts ensures consistent, reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates both our dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We comprehensively validate our approach across various visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87% and 2.32% gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For our high-quality VIPO dataset, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization. All models and datasets will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Poly-DPO, a polynomial extension to the DPO objective designed to handle noisy preference data, and introduces ViPO, a large-scale preference dataset comprising 1M image pairs and 300K video pairs. It resides in the Direct Preference Optimization Extensions leaf, which contains four papers including Diffusion DPO, DRAGON, and methods addressing multi-preference handling. This leaf sits within the broader Preference Optimization Algorithms and Objectives branch, indicating a moderately active research direction focused on adapting DPO-style frameworks to visual generation without explicit reward modeling.

The taxonomy reveals neighboring leaves addressing related challenges: Reinforcement Learning for Visual Generation explores policy-based methods, Multi-Reward and Multi-Objective Optimization tackles balancing multiple signals, and Hierarchical and Granular Preference Alignment organizes preferences across levels. The Preference Data Construction and Curation branch, particularly Synthetic and Automated Preference Data Generation, addresses dataset quality issues similar to ViPO's motivation. The scope notes clarify that this leaf excludes RL-based and reward-centric approaches, positioning the work as a direct optimization method rather than a policy gradient or reward model design contribution.

Among 21 candidates examined, the Poly-DPO algorithm shows no clear refutation (1 candidate examined, 0 refutable), suggesting limited prior work on polynomial confidence adjustments in DPO. The ViPO dataset contribution examined 10 candidates with 1 refutable match, indicating some overlap in large-scale preference data construction. The insight on conflicting preference patterns examined 10 candidates with no refutations, suggesting this framing may be relatively novel. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage of the field.

Based on the top-21 semantic matches examined, the algorithmic contribution appears less explored while the dataset contribution encounters more substantial prior work. The taxonomy structure shows this research direction is neither overcrowded nor sparse, with four sibling papers addressing related DPO extensions. The analysis captures immediate neighbors but does not cover the full landscape of visual preference optimization methods across all eight major branches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Scaling visual preference optimization for generative models. The field has evolved into a rich ecosystem organized around eight major branches. Preference Optimization Algorithms and Objectives explores direct extensions of methods like DPO and IPO, adapting them to visual domains through techniques such as Diffusion DPO[12] and DRAGON[27]. Reward Models and Evaluation Metrics focuses on building robust scoring functions, exemplified by ImageReward[9] and VisionReward[19], to guide model training. Preference Data Construction and Curation addresses the challenge of obtaining high-quality human feedback, while Personalization and User-Specific Adaptation tailors outputs to individual tastes. Reasoning and Prompt Enhancement for Generation leverages chain-of-thought and prompt refinement strategies to improve generation quality. Domain-Specific and Task-Specific Applications targets specialized use cases, Efficiency and Scalability Enhancements tackles computational bottlenecks, and Multimodal Integration and Cross-Modal Alignment bridges vision and language modalities. Within the Preference Optimization Algorithms and Objectives branch, a particularly active line of work centers on direct preference optimization extensions that bypass explicit reward modeling. ViPO[0] sits squarely in this cluster, emphasizing scalable training regimes for visual generative models. It shares conceptual ground with Diffusion DPO[12], which adapts preference learning to diffusion processes, and DRAGON[27], which explores alternative formulations for aligning image generators. Nearby works like Calibrated Multi-Preference[1] and Perpo[3] investigate how to handle diverse or conflicting preference signals, while CHiP[5] introduces hierarchical structures for finer-grained control. The central tension across these methods involves balancing sample efficiency, computational cost, and the ability to capture nuanced human judgments without overfitting to narrow preference distributions.

Claimed Contributions

Poly-DPO optimization algorithm

The authors introduce Poly-DPO, an extension of Diffusion-DPO that adds a polynomial term to dynamically adjust sample weighting based on prediction confidence. This enables effective learning across diverse data distributions, from noisy datasets with conflicting preference patterns to trivially simple patterns.

1 retrieved paper
ViPO large-scale visual preference dataset

The authors construct ViPO, a large-scale high-quality preference dataset containing 1M high-resolution image pairs across five quality dimensions and 300K video pairs across three categories. The dataset uses state-of-the-art generative models and systematic categorization to provide reliable and balanced preference signals.

10 retrieved papers
Can Refute
Insight on conflicting preference patterns as scaling bottleneck

The authors identify that conflicting preference patterns in existing datasets, where winner images excel in some dimensions but underperform in others, represent a fundamental obstacle to scaling visual preference optimization. They show that naively optimizing on such noisy datasets fails to learn meaningful preferences.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Poly-DPO optimization algorithm

The authors introduce Poly-DPO, an extension of Diffusion-DPO that adds a polynomial term to dynamically adjust sample weighting based on prediction confidence. This enables effective learning across diverse data distributions, from noisy datasets with conflicting preference patterns to trivially simple patterns.

Contribution

ViPO large-scale visual preference dataset

The authors construct ViPO, a large-scale high-quality preference dataset containing 1M high-resolution image pairs across five quality dimensions and 300K video pairs across three categories. The dataset uses state-of-the-art generative models and systematic categorization to provide reliable and balanced preference signals.

Contribution

Insight on conflicting preference patterns as scaling bottleneck

The authors identify that conflicting preference patterns in existing datasets, where winner images excel in some dimensions but underperform in others, represent a fundamental obstacle to scaling visual preference optimization. They show that naively optimizing on such noisy datasets fails to learn meaningful preferences.