Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

ICLR 2026 Conference SubmissionAnonymous Authors
AI FairnessUnified Multimodal Large Language Models (UMLLMs)Fairness BenchmarkCross-Task EvaluationBias Measurement
Abstract:

As artificial intelligence (AI) permeates society, ensuring fairness has become a foundational challenge. However, the field faces a “Babel Tower” dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms—particularly in unified multimodal large language models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both the understanding and generation in UMLLMs. Enabled by our high-fidelity demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional “fairness space”, integrating 60 granular metrics across three dimensions—Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the “generation gap”, individual inconsistencies like “personality splits”, and the “counter-stereotype reward”, while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating ever-evolving fairness metrics, ultimately helping to resolve the “Babel Tower” impasse.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces IRIS, a benchmark for synchronous fairness evaluation of both understanding and generation in unified multimodal large language models (UMLLMs). It resides in the 'Comprehensive Fairness Benchmarks' leaf, which contains four papers including this one. This leaf sits within the broader 'Bias Detection and Measurement Methodologies' branch, indicating a moderately populated research direction focused on holistic evaluation frameworks. The paper's positioning suggests it addresses a recognized need for multi-dimensional fairness assessment, though the leaf's size indicates this is not yet a saturated area.

The taxonomy reveals several neighboring research directions. Adjacent leaves include 'Targeted Bias Measurement Approaches' with subcategories for social/demographic bias probing and stereotype measurement, representing more focused diagnostic methods. The sibling 'Evaluation Methodologies and Judge Systems' leaf addresses assessment protocols including LLM-as-judge frameworks. The paper's comprehensive approach distinguishes it from these targeted methods by aggregating 60 metrics across three dimensions (Ideal Fairness, Real-world Fidelity, Bias Inertia & Steerability), positioning it as a unifying framework rather than a single-dimension probe.

Among 28 candidates examined across three contributions, no clearly refuting prior work was identified. The IRIS Benchmark contribution examined 9 candidates with 0 refutable; ARES classifier and datasets examined 10 candidates with 0 refutable; systemic phenomena discovery examined 9 candidates with 0 refutable. This suggests that within the limited search scope, the synchronous evaluation of understanding and generation, the high-dimensional fairness space aggregation, and the specific phenomena identified (generation gap, personality splits, counter-stereotype reward) appear distinct from examined prior work, though the search was not exhaustive.

Based on the limited literature search of 28 candidates, the work appears to occupy a relatively novel position in comprehensive fairness benchmarking for UMLLMs. The synchronous evaluation approach and three-dimensional metric aggregation framework show no direct overlap with examined prior work. However, the analysis covers top-K semantic matches and does not constitute an exhaustive survey of all fairness evaluation literature, leaving open the possibility of related work outside this search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Fairness evaluation in unified multimodal large language models. The field has organized itself around several complementary perspectives. Bias Detection and Measurement Methodologies encompass comprehensive benchmarks and diagnostic tools that systematically probe models for various forms of unfairness, with works like VHELM[7] and FMBench[5] establishing standardized evaluation protocols. Bias Mitigation and Debiasing Techniques focus on intervention strategies, ranging from training-time adjustments to inference-level corrections such as FairCoT[6]. Domain-Specific Fairness Applications examine bias manifestations in particular contexts like medical imaging or news understanding, while Bias Sources and Training Data Analysis investigate root causes in datasets like LAION[37]. Model-Specific Bias Phenomena explore how architectural choices and model properties influence fairness outcomes, and Cross-Modal and Audio-Language Fairness extends the conversation beyond vision-language systems. Broader Context and Survey Literature provides integrative perspectives on the evolving landscape. A particularly active tension exists between developing holistic benchmarks that capture diverse bias dimensions versus targeted interventions for specific fairness concerns. Comprehensive evaluation suites like MultiTrust[35] and HumaniBench[22] aim to assess models across multiple axes of fairness simultaneously, enabling systematic comparison of model behaviors. Fair in Mind[0] contributes to this comprehensive benchmarking direction by providing structured fairness evaluation for unified multimodal models, positioning itself alongside works like MMDT[16] that similarly emphasize broad diagnostic coverage. In contrast, more focused efforts such as Debiasing Multimodal[1] and Fairness Unified[2] concentrate on specific mitigation pathways or theoretical frameworks. The interplay between measurement and mitigation remains central: while benchmarks reveal where models fail, the field continues to grapple with whether fairness can be effectively addressed post-hoc or requires fundamental changes to training paradigms.

Claimed Contributions

IRIS Benchmark for synchronous fairness evaluation

The authors propose IRIS, a novel benchmark that evaluates fairness in unified multimodal large language models by synchronously assessing both generation and understanding tasks across three dimensions: Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability. The benchmark normalizes diverse metrics into a high-dimensional fairness space to enable multi-objective trade-off analysis.

9 retrieved papers
ARES classifier and four evaluation datasets

The authors develop ARES, an adaptive routing expert system for classifying demographic attributes in generated images, and construct four large-scale datasets (IRIS-Ideal-52, IRIS-Steer-60, IRIS-Gen-52, IRIS-Classifier-25) to support rigorous fairness evaluation of multimodal models.

10 retrieved papers
Discovery of systemic fairness phenomena in UMLLMs

Through comprehensive evaluation using the IRIS benchmark, the authors uncover novel systemic phenomena in unified multimodal models, including cross-task personality splits, a generation gap where models underperform in generation compared to understanding, and the counter-stereotype reward effect.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IRIS Benchmark for synchronous fairness evaluation

The authors propose IRIS, a novel benchmark that evaluates fairness in unified multimodal large language models by synchronously assessing both generation and understanding tasks across three dimensions: Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability. The benchmark normalizes diverse metrics into a high-dimensional fairness space to enable multi-objective trade-off analysis.

Contribution

ARES classifier and four evaluation datasets

The authors develop ARES, an adaptive routing expert system for classifying demographic attributes in generated images, and construct four large-scale datasets (IRIS-Ideal-52, IRIS-Steer-60, IRIS-Gen-52, IRIS-Classifier-25) to support rigorous fairness evaluation of multimodal models.

Contribution

Discovery of systemic fairness phenomena in UMLLMs

Through comprehensive evaluation using the IRIS benchmark, the authors uncover novel systemic phenomena in unified multimodal models, including cross-task personality splits, a generation gap where models underperform in generation compared to understanding, and the counter-stereotype reward effect.