Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

clinical natural language processingmechanistic interpretability

LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts.

We first identify SAE latents in gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to "steer" models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become "belligerent". We also find that even in this controlled setting in which we causally intervene to manipulate only patient race, elicited CoT reasoning strings do not communicate that race is a factor in the resulting assessments. We evaluate the degree to which such "steering" via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper applies Sparse Autoencoders to identify and manipulate race-associated latents in Gemma-2 models, demonstrating how these features correlate with stigmatizing concepts in clinical contexts. It resides in the Healthcare-Specific Debiasing and Fairness Interventions leaf, which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across bias detection, mitigation, clinical applications, and foundational frameworks. The small sibling set suggests this mechanistic interpretability approach to healthcare bias remains underexplored compared to detection-focused or domain-specific bias studies.

The taxonomy reveals substantial activity in adjacent areas: Bias Detection and Measurement contains 23 papers across benchmarking, clinical task characterization, and EHR documentation analysis, while Clinical Applications houses 11 papers examining domain-specific manifestations. The original paper bridges these branches by using interpretability tools (SAEs) not merely to detect bias but to causally intervene and assess mitigation potential. Neighboring work like general debiasing techniques focuses on prompt engineering or data augmentation, whereas this paper targets internal model representations. The scope note for this leaf emphasizes healthcare-specific adaptations, distinguishing it from both generic fairness methods and clinical prediction model adjustments.

Among 16 candidates examined, contribution-level analysis shows mixed novelty signals. The SAE application to clinical notes (Contribution 1) examined one candidate with no clear refutation. The causal steering and chain-of-thought unfaithfulness analysis (Contribution 2) examined five candidates, with one appearing to provide overlapping prior work on model steering or reasoning transparency. The SAE-based bias detection and mitigation assessment (Contribution 3) examined ten candidates without clear refutation. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—not an exhaustive literature review. The steering contribution shows more substantial prior work overlap within this constrained sample.

Given the limited search scale and sparse taxonomy leaf, the work appears to occupy a relatively novel position in applying mechanistic interpretability to healthcare bias mitigation. The SAE-based approach differentiates it from sibling papers, though the steering methodology shows some overlap with existing interpretability research. The analysis covers semantic neighbors and direct citations but cannot rule out relevant work outside the top-16 candidates examined. The sparse leaf population suggests either emerging interest or underrepresentation of interpretability-driven debiasing in current healthcare NLP literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: revealing and mitigating racial biases in healthcare language models. The field has organized itself around four main branches that reflect both the diagnostic and therapeutic sides of bias research. Bias Detection and Measurement in Healthcare LLMs encompasses foundational work on identifying disparities in model outputs, often through systematic audits of demographic performance gaps (e.g., Demographic Disparities Medical LLMs[2], Sociodemographic Biases Medical LLMs[5]). Clinical Applications and Domain-Specific Bias Studies examines how biases manifest in particular medical contexts—from emergency triage (Missingness Racial Disparities Triage[12]) to psychiatric assessments (Racial Bias Psychiatric LLMs[18])—revealing that bias patterns vary substantially across specialties. Foundational Bias Research and Methodological Frameworks provides the conceptual scaffolding, drawing on broader fairness literature (Social Biases Language Models[1]) and establishing evaluation protocols. Finally, Bias Mitigation and Fairness Enhancement Strategies focuses on interventions, ranging from data augmentation and prompt engineering to architectural modifications that aim to reduce disparate outcomes. Within the mitigation branch, a particularly active line of work explores healthcare-specific debiasing techniques that go beyond generic fairness methods. SAEs Racial Bias Healthcare[0] exemplifies this direction by using sparse autoencoders to interpret and intervene on racial bias representations within clinical language models, offering a mechanistic lens on how biases encode themselves in model internals. This approach contrasts with neighboring studies like Healthcare Racial LGBTQ Biases[8], which documents intersectional disparities across multiple demographic axes, and Linguistic Bias BERT Medical[25], which examines how pre-trained embeddings inherit biased associations from clinical corpora. A central tension across these works is whether mitigation should target model architectures, training data composition, or deployment-time guardrails—each strategy trading off between interpretability, scalability, and the risk of obscuring rather than eliminating bias. The original paper's focus on interpretable debiasing through sparse feature extraction positions it among efforts seeking not just fairness improvements but also transparency in how models encode sensitive attributes.

Claimed Contributions

Applying SAEs to clinical notes to reveal model associations between race and stigmatizing concepts

1 retrieved paper

The authors apply Sparse Autoencoders to clinical notes and demonstrate that SAE latents can uncover problematic associations that LLMs have learned between patient race and stigmatizing concepts. This represents one of the first assessments of SAEs for LLMs in clinical applications.

1 retrieved paper

Establishing causality via model steering and showing CoT unfaithfulness

Can Refute

5 retrieved papers

The authors use SAE latents to causally steer model behavior, demonstrating that activating race-related latents changes clinical predictions. They further show that chain-of-thought reasoning does not reveal this reliance on race, establishing that CoT explanations are unfaithful in this context.

5 retrieved papers

Can Refute

Assessing SAE-based bias detection and mitigation in clinical tasks

10 retrieved papers

The authors evaluate whether ablating race-related SAE latents can mitigate bias in clinical generation tasks. They find that this approach works for simple tasks but has limited utility for more realistic and complex clinical applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Addressing Healthcare-related Racial and LGBTQ+ Biases in Pretrained Language Models PDF

Sean Xie, Saeed Hassanpour, Soroush Vosoughi (2024) • NAACL-HLT

[25] Mitigating linguistic bias in bert-based medical diagnosis models PDF

S Mathavan (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Applying SAEs to clinical notes to reveal model associations between race and stigmatizing concepts

[33] Disparities in documentation: evidence of race-based biases in the electronic medical record PDF

Cannot Refute

Contribution

Establishing causality via model steering and showing CoT unfaithfulness

[51] Chain-of-thought is not explainability PDF

Can Refute

[52] DeCoT: Debiasing Chain-of-Thought for Knowledge-Intensive Tasks in Large Language Models via Causal Intervention PDF

Cannot Refute

[53] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization PDF

Cannot Refute

[54] Towards trustworthy and reliable language models PDF

Cannot Refute

[55] : Featuring Large Language Models with Causal Reasoning PDF

Cannot Refute

Contribution

Assessing SAE-based bias detection and mitigation in clinical tasks

[56] Safety and effectiveness of pulsed field ablation to treat atrial fibrillation: one-year outcomes from the MANIFEST-PF registry PDF

Cannot Refute

[57] Anatomically-controllable medical image generation with segmentation-guided diffusion models PDF

Cannot Refute

[58] Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias PDF

Cannot Refute

[59] From hope to safety: Unlearning biases of deep models via gradient penalization in latent space PDF

Cannot Refute

[60] Explainable Artificial Intelligence approaches for Image Captioning PDF

Cannot Refute

[61] CIRCLe: Color invariant representation learning for unbiased classification of skin lesions PDF

Cannot Refute

[62] Automated Bias reduction in deep learning based melanoma diagnosis using a semi-supervised algorithm PDF

Cannot Refute

[63] Bias and Fairness in AI for Healthcare: A Case Study on Medical Image Classification Using Multi-Modal Graph Neural Network PDF

Cannot Refute

[64] soutenue publiquement le: 11 Janvier 2024 PDF

Cannot Refute

[65] A Latent Disease Model to Reduce Detection Bias in Cancer Risk Prediction Studies. PDF

Cannot Refute

Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Addressing Healthcare-related Racial and LGBTQ+ Biases in Pretrained Language Models PDF

[25] Mitigating linguistic bias in bert-based medical diagnosis models PDF

Contribution Analysis

Applying SAEs to clinical notes to reveal model associations between race and stigmatizing concepts

[33] Disparities in documentation: evidence of race-based biases in the electronic medical record PDF

Establishing causality via model steering and showing CoT unfaithfulness

[51] Chain-of-thought is not explainability PDF

[52] DeCoT: Debiasing Chain-of-Thought for Knowledge-Intensive Tasks in Large Language Models via Causal Intervention PDF

[53] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization PDF

[54] Towards trustworthy and reliable language models PDF

[55] : Featuring Large Language Models with Causal Reasoning PDF

Assessing SAE-based bias detection and mitigation in clinical tasks

[56] Safety and effectiveness of pulsed field ablation to treat atrial fibrillation: one-year outcomes from the MANIFEST-PF registry PDF

[57] Anatomically-controllable medical image generation with segmentation-guided diffusion models PDF

[58] Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias PDF

[59] From hope to safety: Unlearning biases of deep models via gradient penalization in latent space PDF

[60] Explainable Artificial Intelligence approaches for Image Captioning PDF

[61] CIRCLe: Color invariant representation learning for unbiased classification of skin lesions PDF

[62] Automated Bias reduction in deep learning based melanoma diagnosis using a semi-supervised algorithm PDF

[63] Bias and Fairness in AI for Healthcare: A Case Study on Medical Image Classification Using Multi-Modal Graph Neural Network PDF

[64] soutenue publiquement le: 11 Janvier 2024 PDF

[65] A Latent Disease Model to Reduce Detection Bias in Cancer Risk Prediction Studies. PDF

Table of Contents