Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

AI FairnessUnified Multimodal Large Language Models (UMLLMs)Fairness BenchmarkCross-Task EvaluationBias Measurement

As artificial intelligence (AI) permeates society, ensuring fairness has become a foundational challenge. However, the field faces a “Babel Tower” dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms—particularly in unified multimodal large language models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both the understanding and generation in UMLLMs. Enabled by our high-fidelity demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional “fairness space”, integrating 60 granular metrics across three dimensions—Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the “generation gap”, individual inconsistencies like “personality splits”, and the “counter-stereotype reward”, while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating ever-evolving fairness metrics, ultimately helping to resolve the “Babel Tower” impasse.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces IRIS, a benchmark for synchronous fairness evaluation of both understanding and generation in unified multimodal large language models (UMLLMs). It resides in the 'Comprehensive Fairness Benchmarks' leaf, which contains four papers including this one. This leaf sits within the broader 'Bias Detection and Measurement Methodologies' branch, indicating a moderately populated research direction focused on holistic evaluation frameworks. The paper's positioning suggests it addresses a recognized need for multi-dimensional fairness assessment, though the leaf's size indicates this is not yet a saturated area.

The taxonomy reveals several neighboring research directions. Adjacent leaves include 'Targeted Bias Measurement Approaches' with subcategories for social/demographic bias probing and stereotype measurement, representing more focused diagnostic methods. The sibling 'Evaluation Methodologies and Judge Systems' leaf addresses assessment protocols including LLM-as-judge frameworks. The paper's comprehensive approach distinguishes it from these targeted methods by aggregating 60 metrics across three dimensions (Ideal Fairness, Real-world Fidelity, Bias Inertia & Steerability), positioning it as a unifying framework rather than a single-dimension probe.

Among 28 candidates examined across three contributions, no clearly refuting prior work was identified. The IRIS Benchmark contribution examined 9 candidates with 0 refutable; ARES classifier and datasets examined 10 candidates with 0 refutable; systemic phenomena discovery examined 9 candidates with 0 refutable. This suggests that within the limited search scope, the synchronous evaluation of understanding and generation, the high-dimensional fairness space aggregation, and the specific phenomena identified (generation gap, personality splits, counter-stereotype reward) appear distinct from examined prior work, though the search was not exhaustive.

Based on the limited literature search of 28 candidates, the work appears to occupy a relatively novel position in comprehensive fairness benchmarking for UMLLMs. The synchronous evaluation approach and three-dimensional metric aggregation framework show no direct overlap with examined prior work. However, the analysis covers top-K semantic matches and does not constitute an exhaustive survey of all fairness evaluation literature, leaving open the possibility of related work outside this search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Fairness evaluation in unified multimodal large language models. The field has organized itself around several complementary perspectives. Bias Detection and Measurement Methodologies encompass comprehensive benchmarks and diagnostic tools that systematically probe models for various forms of unfairness, with works like VHELM[7] and FMBench[5] establishing standardized evaluation protocols. Bias Mitigation and Debiasing Techniques focus on intervention strategies, ranging from training-time adjustments to inference-level corrections such as FairCoT[6]. Domain-Specific Fairness Applications examine bias manifestations in particular contexts like medical imaging or news understanding, while Bias Sources and Training Data Analysis investigate root causes in datasets like LAION[37]. Model-Specific Bias Phenomena explore how architectural choices and model properties influence fairness outcomes, and Cross-Modal and Audio-Language Fairness extends the conversation beyond vision-language systems. Broader Context and Survey Literature provides integrative perspectives on the evolving landscape. A particularly active tension exists between developing holistic benchmarks that capture diverse bias dimensions versus targeted interventions for specific fairness concerns. Comprehensive evaluation suites like MultiTrust[35] and HumaniBench[22] aim to assess models across multiple axes of fairness simultaneously, enabling systematic comparison of model behaviors. Fair in Mind[0] contributes to this comprehensive benchmarking direction by providing structured fairness evaluation for unified multimodal models, positioning itself alongside works like MMDT[16] that similarly emphasize broad diagnostic coverage. In contrast, more focused efforts such as Debiasing Multimodal[1] and Fairness Unified[2] concentrate on specific mitigation pathways or theoretical frameworks. The interplay between measurement and mitigation remains central: while benchmarks reveal where models fail, the field continues to grapple with whether fairness can be effectively addressed post-hoc or requires fundamental changes to training paradigms.

Claimed Contributions

IRIS Benchmark for synchronous fairness evaluation

9 retrieved papers

The authors propose IRIS, a novel benchmark that evaluates fairness in unified multimodal large language models by synchronously assessing both generation and understanding tasks across three dimensions: Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability. The benchmark normalizes diverse metrics into a high-dimensional fairness space to enable multi-objective trade-off analysis.

9 retrieved papers

ARES classifier and four evaluation datasets

10 retrieved papers

The authors develop ARES, an adaptive routing expert system for classifying demographic attributes in generated images, and construct four large-scale datasets (IRIS-Ideal-52, IRIS-Steer-60, IRIS-Gen-52, IRIS-Classifier-25) to support rigorous fairness evaluation of multimodal models.

10 retrieved papers

Discovery of systemic fairness phenomena in UMLLMs

9 retrieved papers

Through comprehensive evaluation using the IRIS benchmark, the authors uncover novel systemic phenomena in unified multimodal models, including cross-task personality splits, a generation gap where models underperform in generation compared to understanding, and the counter-stereotype reward effect.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Mmdt: Decoding the trustworthiness and safety of multimodal foundation models PDF

Xu, Chejian, Zhang, Jiawei, Chejian Xu, Chen Zhaorun, Jiawei Zhang, Xie, Chulin, Zhaorun Chen, Kang, Mintong, Chulin Xie, Mintong Kang, Wang Zhun, Yujin Potter, Yuan, Zhuowen, Zhun Wang, Xiong Alexander, Zhuowen Yuan, Xiong, Zidi, Alexander Xiong, Chenhui, Zidi Xiong, Yuan Ling-zhi, Chenhui Zhang, Zeng Yi, Lingzhi Yuan, Xu Peiyang, Yi Zeng, Peiyang Xu, Zhou Andy, Chengquan Guo, Andy Zhou, Zhao Xuan-dong, J. Tan, Pinto, Francesco, Xuandong Zhao, Xiang Zhen, Francesco Pinto, Gai Yu, Zhen Xiang, Lin Zi-nan, Yu Gai, Hendrycks, Dan, Zinan Lin, Li Bo, Dan Hendrycks, Song, Dawn, Bo Li, D. Song (2025)

[22] Humanibench: A human-centric framework for large multimodal models evaluation PDF

Raza, Shaina, Narayanan Aravind, Shaina Raza, Khazaie, Vahid Reza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Singh, Amandeep, M. S. Chettiar, Shah, Mubarak, Amandeep Singh, Pandya, Deval, Mubarak Shah, D. Pandya (2025)

[35] MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models PDF

Huan-Ran Chen, Yinpeng Dong, Yichi Zhang, ZHENGWei-fang, Yao Huang, Yitong Sun, Chang Liu, Hang Su, Zhe Zhao, Zhengwei Fang, Yifan Wang, Xing-Xing Wei, Huanran Chen, Xiao Yang, Xingxing Wei, Jun Zhu (2024) • Neural Information Processing Systems

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IRIS Benchmark for synchronous fairness evaluation

[2] On fairness of unified multimodal large language model for image generation PDF

Cannot Refute

[16] Mmdt: Decoding the trustworthiness and safety of multimodal foundation models PDF

Cannot Refute

[29] Sb-bench: Stereotype bias benchmark for large multimodal models PDF

Cannot Refute

[69] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF

Cannot Refute

[70] Debiased multimodal understanding for human language sequences PDF

Cannot Refute

[71] Fairness and bias in multimodal ai: A survey PDF

Cannot Refute

[72] Facexbench: Evaluating multimodal llms on face understanding PDF

Cannot Refute

[73] MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning PDF

Cannot Refute

[74] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors PDF

Cannot Refute

Contribution

ARES classifier and four evaluation datasets

[51] Fair attribute classification through latent space de-biasing PDF

Cannot Refute

[52] Ai-face: A million-scale demographically annotated ai-generated face dataset and fairness benchmark PDF

Cannot Refute

[53] Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data PDF

Cannot Refute

[54] Towards measuring fairness in ai: the casual conversations dataset PDF

Cannot Refute

[55] Accuracy and Fairness of Facial Recognition Technology in Low-Quality Police Images: An Experiment With Synthetic Faces PDF

Cannot Refute

[56] Evaluating and mitigating bias in image classifiers: A causal perspective using counterfactuals PDF

Cannot Refute

[57] Constructing a fair classifier with generated fair data PDF

Cannot Refute

[58] Zero-Shot Demographically Unbiased Image Generation From an Existing Biased StyleGAN PDF

Cannot Refute

[59] Cat: Controllable attribute translation for fair facial attribute classification PDF

Cannot Refute

[60] A survey on fairness without demographics PDF

Cannot Refute

Contribution

Discovery of systemic fairness phenomena in UMLLMs

[10] Fairclip: Harnessing fairness in vision-language learning PDF

Cannot Refute

[25] Investigating Stereotypical Bias in Large Language and Vision-Language Models PDF

Cannot Refute

[61] Explaining CLIP's performance disparities on data from blind/low vision users PDF

Cannot Refute

[63] Uncovering Cultural Representation Disparities in Vision-Language Models PDF

Cannot Refute

[64] Bridging the digital divide: Performance variation across socio-economic factors in vision-language models PDF

Cannot Refute

[65] Dataset-aware Utopia modality contribution for imbalanced multimodal learning PDF

Cannot Refute

[66] Generating with fairness: A modality-diffused counterfactual framework for incomplete multimodal recommendations PDF

Cannot Refute

[67] Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language representation learning PDF

Cannot Refute

[68] Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks PDF

Cannot Refute

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Mmdt: Decoding the trustworthiness and safety of multimodal foundation models PDF

[22] Humanibench: A human-centric framework for large multimodal models evaluation PDF

[35] MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models PDF

Contribution Analysis

IRIS Benchmark for synchronous fairness evaluation

[2] On fairness of unified multimodal large language model for image generation PDF

[16] Mmdt: Decoding the trustworthiness and safety of multimodal foundation models PDF

[29] Sb-bench: Stereotype bias benchmark for large multimodal models PDF

[69] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF

[70] Debiased multimodal understanding for human language sequences PDF

[71] Fairness and bias in multimodal ai: A survey PDF

[72] Facexbench: Evaluating multimodal llms on face understanding PDF

[73] MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning PDF

[74] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors PDF

ARES classifier and four evaluation datasets

[51] Fair attribute classification through latent space de-biasing PDF

[52] Ai-face: A million-scale demographically annotated ai-generated face dataset and fairness benchmark PDF

[53] Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data PDF

[54] Towards measuring fairness in ai: the casual conversations dataset PDF

[55] Accuracy and Fairness of Facial Recognition Technology in Low-Quality Police Images: An Experiment With Synthetic Faces PDF

[56] Evaluating and mitigating bias in image classifiers: A causal perspective using counterfactuals PDF

[57] Constructing a fair classifier with generated fair data PDF

[58] Zero-Shot Demographically Unbiased Image Generation From an Existing Biased StyleGAN PDF

[59] Cat: Controllable attribute translation for fair facial attribute classification PDF

[60] A survey on fairness without demographics PDF

Discovery of systemic fairness phenomena in UMLLMs

[10] Fairclip: Harnessing fairness in vision-language learning PDF

[25] Investigating Stereotypical Bias in Large Language and Vision-Language Models PDF

[61] Explaining CLIP's performance disparities on data from blind/low vision users PDF

[63] Uncovering Cultural Representation Disparities in Vision-Language Models PDF

[64] Bridging the digital divide: Performance variation across socio-economic factors in vision-language models PDF

[65] Dataset-aware Utopia modality contribution for imbalanced multimodal learning PDF

[66] Generating with fairness: A modality-diffused counterfactual framework for incomplete multimodal recommendations PDF

[67] Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language representation learning PDF

[68] Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks PDF

Table of Contents