Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

ICLR 2026 Conference SubmissionAnonymous Authors
Omni-Modal ModelsReward ModelsAlignment
Abstract:

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
41
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: omni-modal reward modeling with free-form preferences. The field centers on learning reward functions that can evaluate agent outputs across diverse modalities—text, images, video, and structured data—using human or synthetic preference signals that are not constrained to rigid formats. The taxonomy reveals several major branches: Reward Model Architecture and Training Methodology explores how to build and train these models, including generalist frameworks that unify multiple modalities (e.g., Omni-Reward[0], InternLM-XComposer Reward[12], Unified Reward Multimodal[19]); Preference Optimization and Alignment Methods focuses on techniques like direct preference optimization and modality-balancing strategies (Modality-balancing Preference[4], mDPO[10]); Application Domains and Task-Specific Reward Modeling addresses specialized settings such as vision-language tasks, mathematical reasoning, and agentic environments (Visionreward[1], Agent-RewardBench[18]); Evaluation Benchmarks and Datasets provides standardized testbeds (Multimodal Rewardbench[6]); Surveys and Comparative Studies synthesize progress (Aligning Multimodal Survey[16]); and Cognitive and Behavioral Studies examine how humans process multimodal rewards (Multimodal Rewards Rankings[3], Multimodal Reward Cues[40]). A particularly active line of work involves generalist reward models that handle multiple modalities within a single architecture, aiming to capture cross-modal dependencies and avoid unimodal biases (Unimodal Spurious Correlations[22]). These models often leverage large-scale preference data and sophisticated training recipes to balance modality-specific signals. Omni-Reward[0] sits squarely in this generalist branch, emphasizing free-form preference inputs and broad applicability across tasks. It shares conceptual ground with InternLM-XComposer Reward[12] and Unified Reward Multimodal[19], which also pursue unified architectures, but differs in its explicit focus on flexible, unstructured preference formats rather than fixed rubrics (contrast Rubrics as Rewards[5]) or domain-specific tuning (BaseReward[29], Skywork-VL Reward[35]). A key open question across these efforts is how to efficiently scale generalist models while maintaining fine-grained sensitivity to task-specific nuances and avoiding the pitfalls of modality imbalance or spurious correlations.

Claimed Contributions

Omni-RewardBench: first omni-modal RM benchmark with free-form preferences

The authors introduce a comprehensive benchmark for evaluating reward models across five modalities (text, image, video, audio, 3D) covering nine tasks with 3,725 human-annotated preference pairs. The benchmark uniquely incorporates free-form preference descriptions rather than fixed binary preferences, enabling evaluation of RMs under diverse user-specified criteria.

10 retrieved papers
Omni-RewardData: multimodal preference dataset with instruction-tuning pairs

The authors build a large-scale multimodal preference dataset that combines general preference pairs from existing sources with newly collected instruction-tuning data. This dataset enables reward models to generalize across modalities and dynamically align with diverse user preferences expressed in natural language.

10 retrieved papers
Can Refute
Omni-RewardModel: discriminative and generative omni-modal reward models

The authors develop two types of reward models: a discriminative model trained with Bradley-Terry loss and a generative model trained with reinforcement learning that produces explicit reasoning. These models demonstrate significant improvements on the proposed benchmark and achieve performance comparable to or exceeding state-of-the-art on public benchmarks.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Omni-RewardBench: first omni-modal RM benchmark with free-form preferences

The authors introduce a comprehensive benchmark for evaluating reward models across five modalities (text, image, video, audio, 3D) covering nine tasks with 3,725 human-annotated preference pairs. The benchmark uniquely incorporates free-form preference descriptions rather than fixed binary preferences, enabling evaluation of RMs under diverse user-specified criteria.

Contribution

Omni-RewardData: multimodal preference dataset with instruction-tuning pairs

The authors build a large-scale multimodal preference dataset that combines general preference pairs from existing sources with newly collected instruction-tuning data. This dataset enables reward models to generalize across modalities and dynamically align with diverse user preferences expressed in natural language.

Contribution

Omni-RewardModel: discriminative and generative omni-modal reward models

The authors develop two types of reward models: a discriminative model trained with Bradley-Terry loss and a generative model trained with reinforcement learning that produces explicit reasoning. These models demonstrate significant improvements on the proposed benchmark and achieve performance comparable to or exceeding state-of-the-art on public benchmarks.