Guardrail-Agnostic Societal Bias Evaluation in Large Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Social biasLVLMsBias evaluation
Abstract:

We propose a societal bias evaluation method for large vision-language models (LVLMs) in the era of strong safety guardrails. Existing benchmarks rely on prompts that ask models to infer attributes of people in images (e.g., "Is this person a CEO or a secretary?"). However, we find that LVLMs with strong guardrails, such as GPT and Claude, often refuse these prompts, making evaluations unreliable. To address this, we change the prior evaluation paradigm by decoupling the task from the depicted person: instead of inferring person's attributes, we use prompts that do not ask about the person (e.g., "Write a fictional story about an imaginary person.") and attach the image as provisional user information to implicitly provide demographic cues, then compare outputs across user demographics. Instantiated across three tasks — story generation, term explanation, and exam-style QA — our method avoids refusals even in guardrailed LVLMs, enabling reliable bias measurement. Applying it to 20 recent LVLMs, both open-source and proprietary, we find that all models undesirably use user demographic information in person-irrelevant tasks; for instance, characters in stories are often portrayed as mechanic for male users and nurse for female users. Although still biased, proprietary models like GPT-5 show lower bias than open-source ones. We analyze potential factors behind this gap, discussing continuous model monitoring and improvement as a possible driving factor for reducing bias.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a guardrail-agnostic evaluation method for societal bias in large vision-language models, addressing the challenge that safety-aligned models often refuse direct attribute-inference prompts. It resides in the 'Large Vision-Language Models and Assistants' leaf, which contains five papers examining bias in generative LVLMs and instruction-tuned assistants. This leaf sits within a broader taxonomy of 50 papers across bias measurement, mitigation, and analysis, indicating a moderately populated research direction focused on modern conversational VLMs rather than contrastive models like CLIP.

The taxonomy reveals neighboring leaves examining contrastive VLMs (five papers on CLIP-family models) and pretrained VLM families (three comparative studies). The paper's focus on guardrail interactions distinguishes it from sibling works like Gender Biases VLAs, which documents bias prevalence without addressing safety mechanisms, and Counterfactuals at Scale, which uses data augmentation for robustness testing. The 'Bias Measurement Frameworks' branch contains comprehensive benchmarks (six papers) and specialized tools (six papers), but none explicitly tackle the refusal problem in safety-aligned models, suggesting the paper addresses a gap at the intersection of bias evaluation and model alignment.

Among 27 candidates examined through limited semantic search, none clearly refute the three core contributions. The guardrail-agnostic method examined 10 candidates with zero refutations, the three-task instantiation examined 10 with zero refutations, and the paradigm shift from attribute-inferring to person-irrelevant prompts examined 7 with zero refutations. This suggests that within the search scope—focused on recent LVLM bias literature—the specific approach of decoupling demographic cues from task prompts to circumvent safety refusals has not been documented. However, the limited search scale means unexplored work in adjacent areas (prompt engineering, implicit bias probing) may exist.

The analysis indicates novelty within the examined literature, particularly in addressing how safety guardrails complicate bias measurement. The taxonomy shows the field has developed rich benchmarks and mitigation techniques, but the specific methodological challenge of evaluating guardrailed models appears underexplored among the 50 papers surveyed. The contribution's distinctiveness depends partly on whether the 'person-irrelevant prompt' strategy represents a fundamental paradigm shift or an incremental adaptation of existing implicit bias probing methods, which the limited search scope cannot definitively resolve.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: societal bias evaluation in large vision-language models. The field has organized itself around several complementary branches that together address how biases emerge, persist, and might be measured or reduced in VLMs. Bias Measurement Frameworks and Benchmarks establish standardized evaluation protocols and datasets—ranging from holistic suites like Vhelm Holistic Evaluation[1] to specialized resources such as Vlbiasbench[4] and VIGNETTE[6]—that enable systematic comparison across models. Bias Analysis in Specific VLM Architectures examines how particular model families (large assistants, retrieval-augmented systems, or embodied agents) exhibit distinct bias patterns, while Bias Sources and Contributing Factors investigates the origins of bias in training data, architectural choices, and scaling dynamics. Bias Mitigation Techniques explores interventions such as prompt engineering, debiasing algorithms, and fairness-aware training, and Domain-Specific and Contextual Bias Studies zoom into particular application areas (news captioning, medical imaging, cultural contexts) where biases have unique manifestations. Finally, Methodological Advances and Meta-Analysis refines evaluation metrics and synthesizes findings across studies to improve the rigor of bias research. Recent work has intensified around two contrasting themes: developing richer benchmarks that capture intersectional and implicit biases (e.g., Implicit Social Biases[5], Social Perception Faces[2]) versus probing how guardrails and safety mechanisms themselves interact with bias (Guardrail Agnostic Bias[0]). The original paper, Guardrail Agnostic Bias[0], sits within the branch analyzing large vision-language models and assistants, where it addresses a relatively underexplored question: whether safety interventions inadvertently modulate or mask underlying biases. This contrasts with nearby studies like Gender Biases VLAs[44], which document bias prevalence in vision-language agents, and Counterfactuals at Scale[46], which uses large-scale counterfactual generation to measure bias robustness. By focusing on guardrail effects, Guardrail Agnostic Bias[0] highlights an emerging concern that evaluation must account for post-hoc safety layers, not just base model behavior, to obtain a complete picture of societal bias in deployed systems.

Claimed Contributions

Guardrail-agnostic societal bias evaluation method for LVLMs

The authors introduce a new evaluation framework that decouples the task from the depicted person by using person-irrelevant prompts and treating images as provisional user information rather than the subject of inference. This design avoids model refusals triggered by safety guardrails, enabling reliable bias measurement even in strongly guarded models.

10 retrieved papers
Three-task instantiation of the evaluation framework

The authors instantiate their evaluation protocol across three person-irrelevant tasks: story generation, term explanation, and exam-style QA. Each task is designed to probe different aspects of societal bias while maintaining zero refusal rates across all tested models.

10 retrieved papers
Paradigm shift from attribute-inferring to person-irrelevant prompts

The authors propose a fundamental change in how bias is evaluated by replacing attribute-inferring prompts with person-irrelevant ones and changing the role of images from target to context. This paradigm shift addresses the refusal problem in existing benchmarks while reducing the impact of spurious image contexts.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Guardrail-agnostic societal bias evaluation method for LVLMs

The authors introduce a new evaluation framework that decouples the task from the depicted person by using person-irrelevant prompts and treating images as provisional user information rather than the subject of inference. This design avoids model refusals triggered by safety guardrails, enabling reliable bias measurement even in strongly guarded models.

Contribution

Three-task instantiation of the evaluation framework

The authors instantiate their evaluation protocol across three person-irrelevant tasks: story generation, term explanation, and exam-style QA. Each task is designed to probe different aspects of societal bias while maintaining zero refusal rates across all tested models.

Contribution

Paradigm shift from attribute-inferring to person-irrelevant prompts

The authors propose a fundamental change in how bias is evaluated by replacing attribute-inferring prompts with person-irrelevant ones and changing the role of images from target to context. This paradigm shift addresses the refusal problem in existing benchmarks while reducing the impact of spurious image contexts.

Guardrail-Agnostic Societal Bias Evaluation in Large Vision-Language Models | Novelty Validation