Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Omni-Modal ModelsReward ModelsAlignment

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: omni-modal reward modeling with free-form preferences. The field centers on learning reward functions that can evaluate agent outputs across diverse modalities—text, images, video, and structured data—using human or synthetic preference signals that are not constrained to rigid formats. The taxonomy reveals several major branches: Reward Model Architecture and Training Methodology explores how to build and train these models, including generalist frameworks that unify multiple modalities (e.g., Omni-Reward[0], InternLM-XComposer Reward[12], Unified Reward Multimodal[19]); Preference Optimization and Alignment Methods focuses on techniques like direct preference optimization and modality-balancing strategies (Modality-balancing Preference[4], mDPO[10]); Application Domains and Task-Specific Reward Modeling addresses specialized settings such as vision-language tasks, mathematical reasoning, and agentic environments (Visionreward[1], Agent-RewardBench[18]); Evaluation Benchmarks and Datasets provides standardized testbeds (Multimodal Rewardbench[6]); Surveys and Comparative Studies synthesize progress (Aligning Multimodal Survey[16]); and Cognitive and Behavioral Studies examine how humans process multimodal rewards (Multimodal Rewards Rankings[3], Multimodal Reward Cues[40]). A particularly active line of work involves generalist reward models that handle multiple modalities within a single architecture, aiming to capture cross-modal dependencies and avoid unimodal biases (Unimodal Spurious Correlations[22]). These models often leverage large-scale preference data and sophisticated training recipes to balance modality-specific signals. Omni-Reward[0] sits squarely in this generalist branch, emphasizing free-form preference inputs and broad applicability across tasks. It shares conceptual ground with InternLM-XComposer Reward[12] and Unified Reward Multimodal[19], which also pursue unified architectures, but differs in its explicit focus on flexible, unstructured preference formats rather than fixed rubrics (contrast Rubrics as Rewards[5]) or domain-specific tuning (BaseReward[29], Skywork-VL Reward[35]). A key open question across these efforts is how to efficiently scale generalist models while maintaining fine-grained sensitivity to task-specific nuances and avoiding the pitfalls of modality imbalance or spurious correlations.

Claimed Contributions

Omni-RewardBench: first omni-modal RM benchmark with free-form preferences

10 retrieved papers

The authors introduce a comprehensive benchmark for evaluating reward models across five modalities (text, image, video, audio, 3D) covering nine tasks with 3,725 human-annotated preference pairs. The benchmark uniquely incorporates free-form preference descriptions rather than fixed binary preferences, enabling evaluation of RMs under diverse user-specified criteria.

10 retrieved papers

Omni-RewardData: multimodal preference dataset with instruction-tuning pairs

Can Refute

10 retrieved papers

The authors build a large-scale multimodal preference dataset that combines general preference pairs from existing sources with newly collected instruction-tuning data. This dataset enables reward models to generalize across modalities and dynamically align with diverse user preferences expressed in natural language.

10 retrieved papers

Can Refute

Omni-RewardModel: discriminative and generative omni-modal reward models

9 retrieved papers

The authors develop two types of reward models: a discriminative model trained with Bradley-Terry loss and a generative model trained with reinforcement learning that produces explicit reasoning. These models demonstrate significant improvements on the proposed benchmark and achieve performance comparable to or exceeding state-of-the-art on public benchmarks.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

Zang, Yuhang, Dong, Xiaoyi, Yuhang Zang, Zhang Pan, Xiao-wen Dong, Cao, Pan Zhang, Liu Zi-yu, Yuhang Cao, Ding Shengyuan, Ziyu Liu, Shengyuan Ding, MA YuBo, Shenxi Wu, Duan, Haodong, Yubo Ma, Zhang, Wenwei, Haodong Duan, Chen Kai, Wenwei Zhang, Lin, Dahua, Kai Chen, Wang, Jiaqi, Dahua Lin, Jiaqi Wang (2025)

[19] Unified Reward Model for Multimodal Understanding and Generation PDF

Wang Yi-bin, Zang, Yuhang, Yibin Wang, Li Hao, Yuhang Zang, Jin Cheng, Hao Li, Wang, Jiaqi, Cheng Jin, Jiaqi Wang (2025)

[29] BaseReward: A Strong Baseline for Multimodal Reward Model PDF

Zhang Yi Fan, Yang Hai-hua, Yifan Zhang, Zhang HuanYu, Haihua Yang, Shi Yang, Huanyu Zhang, Chen Zezhou, Yang Shi, Tian Haochen, Zezhou Chen, Fu, Chaoyou, Haochen Tian, Wang, Haotian, Chaoyou Fu, Wu Kai, Haotian Wang, Cui Bo, Kai Wu, Wang Xu, Bo Cui, Pan Jianfei, Xu Wang, Zhang Zhang, Jianfei Pan, Wang Liang, Liang Wang (2025)

[35] Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning PDF

Wang Xiao-kun, Wang Peiyu, Xiaokun Wang, Pei, Jiangbo, Peiyu Wang, Shen Wei, Jiangbo Pei, Peng Yi, Wei Shen, Yi Peng, Qiu Wei-jie, Yunzhuo Hao, Jian Ai, Weijie Qiu, Xie Tianyidan, Ai Jian, Song, Xuchen, Tianyidan Xie, Liu Yang, Xuchen Song, Zhou Ya-hui, Yang Liu, Yahui Zhou (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Omni-RewardBench: first omni-modal RM benchmark with free-form preferences

[50] Llava-critic: Learning to evaluate multimodal models PDF

Cannot Refute

[51] Chip: Cross-modal hierarchical direct preference optimization for multimodal llms PDF

Cannot Refute

[52] Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs PDF

Cannot Refute

[53] VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding PDF

Cannot Refute

[54] MJ-bench: Is your multimodal reward model really a good judge for text-to-image generation? PDF

Cannot Refute

[55] Structured preference modeling for reinforcement learning-based fine-tuning of large models PDF

Cannot Refute

[56] Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models PDF

Cannot Refute

[57] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation PDF

Cannot Refute

[58] Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment PDF

Cannot Refute

[59] A RAG Approach for Multi-Modal Open-ended Lifelog Question-Answering PDF

Cannot Refute

Contribution

Omni-RewardData: multimodal preference dataset with instruction-tuning pairs

[47] Vlfeedback: A large-scale ai feedback dataset for large vision-language models alignment PDF

Can Refute

[12] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

Cannot Refute

[33] M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following PDF

Cannot Refute

[42] Aligning large multimodal models with factually augmented rlhf PDF

Cannot Refute

[43] Multi-modal preference alignment remedies degradation of visual instruction tuning on language models PDF

Cannot Refute

[44] Tuning large multimodal models for videos using reinforcement learning from ai feedback PDF

Cannot Refute

[45] Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs PDF

Cannot Refute

[46] Align2llava: Cascaded human and large language model preference alignment for multi-modal instruction curation PDF

Cannot Refute

[48] Multimodal large language model is a human-aligned annotator for text-to-image generation PDF

Cannot Refute

[49] Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization PDF

Cannot Refute

Contribution

Omni-RewardModel: discriminative and generative omni-modal reward models

[60] Generative reward models PDF

Cannot Refute

[61] A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization PDF

Cannot Refute

[62] Beyond bradley-terry models: A general preference model for language model alignment PDF

Cannot Refute

[63] Differentially private reward estimation with preference feedback PDF

Cannot Refute

[64] Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization PDF

Cannot Refute

[66] Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models PDF

Cannot Refute

[67] RewardDance: Reward Scaling in Visual Generation PDF

Cannot Refute

[68] ACECODER: Acing Coder RL via Automated Test-Case Synthesis PDF

Cannot Refute

[69] Multi-dimensional Preference Alignment by Conditioning Reward Itself PDF

Cannot Refute

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

[19] Unified Reward Model for Multimodal Understanding and Generation PDF

[29] BaseReward: A Strong Baseline for Multimodal Reward Model PDF

[35] Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning PDF

Contribution Analysis

Omni-RewardBench: first omni-modal RM benchmark with free-form preferences

[50] Llava-critic: Learning to evaluate multimodal models PDF

[51] Chip: Cross-modal hierarchical direct preference optimization for multimodal llms PDF

[52] Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs PDF

[53] VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding PDF

[54] MJ-bench: Is your multimodal reward model really a good judge for text-to-image generation? PDF

[55] Structured preference modeling for reinforcement learning-based fine-tuning of large models PDF

[56] Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models PDF

[57] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation PDF

[58] Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment PDF

[59] A RAG Approach for Multi-Modal Open-ended Lifelog Question-Answering PDF

Omni-RewardData: multimodal preference dataset with instruction-tuning pairs

[47] Vlfeedback: A large-scale ai feedback dataset for large vision-language models alignment PDF

[12] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

[33] M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following PDF

[42] Aligning large multimodal models with factually augmented rlhf PDF

[43] Multi-modal preference alignment remedies degradation of visual instruction tuning on language models PDF

[44] Tuning large multimodal models for videos using reinforcement learning from ai feedback PDF

[45] Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs PDF

[46] Align2llava: Cascaded human and large language model preference alignment for multi-modal instruction curation PDF

[48] Multimodal large language model is a human-aligned annotator for text-to-image generation PDF

[49] Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization PDF

Omni-RewardModel: discriminative and generative omni-modal reward models

[60] Generative reward models PDF

[61] A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization PDF

[62] Beyond bradley-terry models: A general preference model for language model alignment PDF

[63] Differentially private reward estimation with preference feedback PDF

[64] Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization PDF

[66] Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models PDF

[67] RewardDance: Reward Scaling in Visual Generation PDF

[68] ACECODER: Acing Coder RL via Automated Test-Case Synthesis PDF

[69] Multi-dimensional Preference Alignment by Conditioning Reward Itself PDF

Table of Contents