Guardrail-Agnostic Societal Bias Evaluation in Large Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Social biasLVLMsBias evaluation

We propose a societal bias evaluation method for large vision-language models (LVLMs) in the era of strong safety guardrails. Existing benchmarks rely on prompts that ask models to infer attributes of people in images (e.g., "Is this person a CEO or a secretary?"). However, we find that LVLMs with strong guardrails, such as GPT and Claude, often refuse these prompts, making evaluations unreliable. To address this, we change the prior evaluation paradigm by decoupling the task from the depicted person: instead of inferring person's attributes, we use prompts that do not ask about the person (e.g., "Write a fictional story about an imaginary person.") and attach the image as provisional user information to implicitly provide demographic cues, then compare outputs across user demographics. Instantiated across three tasks — story generation, term explanation, and exam-style QA — our method avoids refusals even in guardrailed LVLMs, enabling reliable bias measurement. Applying it to 20 recent LVLMs, both open-source and proprietary, we find that all models undesirably use user demographic information in person-irrelevant tasks; for instance, characters in stories are often portrayed as mechanic for male users and nurse for female users. Although still biased, proprietary models like GPT-5 show lower bias than open-source ones. We analyze potential factors behind this gap, discussing continuous model monitoring and improvement as a possible driving factor for reducing bias.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a guardrail-agnostic evaluation method for societal bias in large vision-language models, addressing the challenge that safety-aligned models often refuse direct attribute-inference prompts. It resides in the 'Large Vision-Language Models and Assistants' leaf, which contains five papers examining bias in generative LVLMs and instruction-tuned assistants. This leaf sits within a broader taxonomy of 50 papers across bias measurement, mitigation, and analysis, indicating a moderately populated research direction focused on modern conversational VLMs rather than contrastive models like CLIP.

The taxonomy reveals neighboring leaves examining contrastive VLMs (five papers on CLIP-family models) and pretrained VLM families (three comparative studies). The paper's focus on guardrail interactions distinguishes it from sibling works like Gender Biases VLAs, which documents bias prevalence without addressing safety mechanisms, and Counterfactuals at Scale, which uses data augmentation for robustness testing. The 'Bias Measurement Frameworks' branch contains comprehensive benchmarks (six papers) and specialized tools (six papers), but none explicitly tackle the refusal problem in safety-aligned models, suggesting the paper addresses a gap at the intersection of bias evaluation and model alignment.

Among 27 candidates examined through limited semantic search, none clearly refute the three core contributions. The guardrail-agnostic method examined 10 candidates with zero refutations, the three-task instantiation examined 10 with zero refutations, and the paradigm shift from attribute-inferring to person-irrelevant prompts examined 7 with zero refutations. This suggests that within the search scope—focused on recent LVLM bias literature—the specific approach of decoupling demographic cues from task prompts to circumvent safety refusals has not been documented. However, the limited search scale means unexplored work in adjacent areas (prompt engineering, implicit bias probing) may exist.

The analysis indicates novelty within the examined literature, particularly in addressing how safety guardrails complicate bias measurement. The taxonomy shows the field has developed rich benchmarks and mitigation techniques, but the specific methodological challenge of evaluating guardrailed models appears underexplored among the 50 papers surveyed. The contribution's distinctiveness depends partly on whether the 'person-irrelevant prompt' strategy represents a fundamental paradigm shift or an incremental adaptation of existing implicit bias probing methods, which the limited search scope cannot definitively resolve.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: societal bias evaluation in large vision-language models. The field has organized itself around several complementary branches that together address how biases emerge, persist, and might be measured or reduced in VLMs. Bias Measurement Frameworks and Benchmarks establish standardized evaluation protocols and datasets—ranging from holistic suites like Vhelm Holistic Evaluation[1] to specialized resources such as Vlbiasbench[4] and VIGNETTE[6]—that enable systematic comparison across models. Bias Analysis in Specific VLM Architectures examines how particular model families (large assistants, retrieval-augmented systems, or embodied agents) exhibit distinct bias patterns, while Bias Sources and Contributing Factors investigates the origins of bias in training data, architectural choices, and scaling dynamics. Bias Mitigation Techniques explores interventions such as prompt engineering, debiasing algorithms, and fairness-aware training, and Domain-Specific and Contextual Bias Studies zoom into particular application areas (news captioning, medical imaging, cultural contexts) where biases have unique manifestations. Finally, Methodological Advances and Meta-Analysis refines evaluation metrics and synthesizes findings across studies to improve the rigor of bias research. Recent work has intensified around two contrasting themes: developing richer benchmarks that capture intersectional and implicit biases (e.g., Implicit Social Biases[5], Social Perception Faces[2]) versus probing how guardrails and safety mechanisms themselves interact with bias (Guardrail Agnostic Bias[0]). The original paper, Guardrail Agnostic Bias[0], sits within the branch analyzing large vision-language models and assistants, where it addresses a relatively underexplored question: whether safety interventions inadvertently modulate or mask underlying biases. This contrasts with nearby studies like Gender Biases VLAs[44], which document bias prevalence in vision-language agents, and Counterfactuals at Scale[46], which uses large-scale counterfactual generation to measure bias robustness. By focusing on guardrail effects, Guardrail Agnostic Bias[0] highlights an emerging concern that evaluation must account for post-hoc safety layers, not just base model behavior, to obtain a complete picture of societal bias in deployed systems.

Claimed Contributions

Guardrail-agnostic societal bias evaluation method for LVLMs

10 retrieved papers

The authors introduce a new evaluation framework that decouples the task from the depicted person by using person-irrelevant prompts and treating images as provisional user information rather than the subject of inference. This design avoids model refusals triggered by safety guardrails, enabling reliable bias measurement even in strongly guarded models.

10 retrieved papers

Three-task instantiation of the evaluation framework

10 retrieved papers

The authors instantiate their evaluation protocol across three person-irrelevant tasks: story generation, term explanation, and exam-style QA. Each task is designed to probe different aspects of societal bias while maintaining zero refusal rates across all tested models.

10 retrieved papers

Paradigm shift from attribute-inferring to person-irrelevant prompts

7 retrieved papers

The authors propose a fundamental change in how bias is evaluated by replacing attribute-inferring prompts with person-irrelevant ones and changing the role of images from target to context. This paradigm shift addresses the refusal problem in existing benchmarks while reducing the impact of spurious image contexts.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Uncovering bias in large vision-language models with counterfactuals PDF

Howard, Phillip, Bhiwandiwalla, Anahita, Phillip Howard, Fraser, Kathleen C., Anahita Bhiwandiwalla, Kiritchenko, Svetlana, Kathleen C. Fraser, S. Kiritchenko (2024)

[17] A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models PDF

Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi (2025)

[44] Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs) PDF

Girrbach, Leander, Alaniz, Stephan, Leander Girrbach, Huang Yiran, Yiran Huang, Darrell, Trevor, Stephan Alaniz, Akata, Zeynep, Trevor Darrell, Zeynep Akata (2024) • International Conference on Learning Representations

[46] Uncovering bias in large vision-language models at scale with counterfactuals PDF

Bhiwandiwalla, Anahita, Fraser Kathleen, Howard, Phillip R., Kiritchenko, Svetlana (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Guardrail-agnostic societal bias evaluation method for LVLMs

[9] Investigating Stereotypical Bias in Large Language and Vision-Language Models PDF

Cannot Refute

[10] Data matters most: Auditing social bias in contrastive visionâlanguage models PDF

Cannot Refute

[23] Social debiasing for fair multi-modal llms PDF

Cannot Refute

[24] Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples PDF

Cannot Refute

[32] Debiasing vision-language models via biased prompts PDF

Cannot Refute

[33] A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning PDF

Cannot Refute

[51] Words or Vision: Do Vision-Language Models Have Blind Faith in Text? PDF

Cannot Refute

[52] Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models PDF

Cannot Refute

[53] CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? PDF

Cannot Refute

[54] Red-Teaming for Inducing Societal Bias in Large Language Models PDF

Cannot Refute

Contribution

Three-task instantiation of the evaluation framework

[62] Trustllm: Trustworthiness in large language models PDF

Cannot Refute

[63] Multi-objective linguistic control of large language models PDF

Cannot Refute

[64] Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned PDF

Cannot Refute

[65] A Methodological Framework for Auditing Norm-Sensitive Behaviour in Large Language Models: Research Design for Employment Contexts PDF

Cannot Refute

[66] Understanding Large Language Model Vulnerabilities to Social Bias Attacks PDF

Cannot Refute

[67] Silenced Biases: The Dark Side LLMs Learned to Refuse PDF

Cannot Refute

[68] Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge PDF

Cannot Refute

[69] Bias Beyond Demographics: Probing Decision Boundaries in Black-Box LVLMs via Counterfactual VQA PDF

Cannot Refute

[70] From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs PDF

Cannot Refute

[71] One Model for All: Multi-Objective Controllable Language Models PDF

Cannot Refute

Contribution

Paradigm shift from attribute-inferring to person-irrelevant prompts

[55] Large language model as attributed training data generator: A tale of diversity and bias PDF

Cannot Refute

[56] Can multimodal large language models enhance performance benefits among higher education students? An investigation based on the taskâtechnology fit â¦ PDF

Cannot Refute

[57] Investors' acceptance and use of investment-based crowdfunding platforms: an integrated perspective PDF

Cannot Refute

[58] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics? PDF

Cannot Refute

[59] Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation PDF

Cannot Refute

[60] Towards Alleviating the Object Bias in Prompt Tuning-based Factual Knowledge Extraction PDF

Cannot Refute

[61] Selective Retrieval of Stimulus Information versus Thematic Judgments in Natural Language Inferences. PDF

Cannot Refute

Guardrail-Agnostic Societal Bias Evaluation in Large Vision-Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Uncovering bias in large vision-language models with counterfactuals PDF

[17] A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models PDF

[44] Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs) PDF

[46] Uncovering bias in large vision-language models at scale with counterfactuals PDF

Contribution Analysis

Guardrail-agnostic societal bias evaluation method for LVLMs

[9] Investigating Stereotypical Bias in Large Language and Vision-Language Models PDF

[10] Data matters most: Auditing social bias in contrastive visionâlanguage models PDF

[23] Social debiasing for fair multi-modal llms PDF

[24] Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples PDF

[32] Debiasing vision-language models via biased prompts PDF

[33] A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning PDF

[51] Words or Vision: Do Vision-Language Models Have Blind Faith in Text? PDF

[52] Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models PDF

[53] CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? PDF

[54] Red-Teaming for Inducing Societal Bias in Large Language Models PDF

Three-task instantiation of the evaluation framework

[62] Trustllm: Trustworthiness in large language models PDF

[63] Multi-objective linguistic control of large language models PDF

[64] Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned PDF

[65] A Methodological Framework for Auditing Norm-Sensitive Behaviour in Large Language Models: Research Design for Employment Contexts PDF

[66] Understanding Large Language Model Vulnerabilities to Social Bias Attacks PDF

[67] Silenced Biases: The Dark Side LLMs Learned to Refuse PDF

[68] Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge PDF

[69] Bias Beyond Demographics: Probing Decision Boundaries in Black-Box LVLMs via Counterfactual VQA PDF

[70] From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs PDF

[71] One Model for All: Multi-Objective Controllable Language Models PDF

Paradigm shift from attribute-inferring to person-irrelevant prompts

[55] Large language model as attributed training data generator: A tale of diversity and bias PDF

[56] Can multimodal large language models enhance performance benefits among higher education students? An investigation based on the taskâtechnology fit â¦ PDF

[57] Investors' acceptance and use of investment-based crowdfunding platforms: an integrated perspective PDF

[58] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics? PDF

[59] Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation PDF

[60] Towards Alleviating the Object Bias in Prompt Tuning-based Factual Knowledge Extraction PDF

[61] Selective Retrieval of Stimulus Information versus Thematic Judgments in Natural Language Inferences. PDF

Table of Contents

[10] Data matters most: Auditing social bias in contrastive visionâlanguage models PDF

[56] Can multimodal large language models enhance performance benefits among higher education students? An investigation based on the taskâtechnology fit â¦ PDF