Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

ICLR 2026 Conference SubmissionAnonymous Authors
clinical natural language processingmechanistic interpretability
Abstract:

LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts.

We first identify SAE latents in gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to "steer" models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become "belligerent". We also find that even in this controlled setting in which we causally intervene to manipulate only patient race, elicited CoT reasoning strings do not communicate that race is a factor in the resulting assessments. We evaluate the degree to which such "steering" via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper applies Sparse Autoencoders to identify and manipulate race-associated latents in Gemma-2 models, demonstrating how these features correlate with stigmatizing concepts in clinical contexts. It resides in the Healthcare-Specific Debiasing and Fairness Interventions leaf, which contains only three papers total. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across bias detection, mitigation, clinical applications, and foundational frameworks. The small sibling set suggests this mechanistic interpretability approach to healthcare bias remains underexplored compared to detection-focused or domain-specific bias studies.

The taxonomy reveals substantial activity in adjacent areas: Bias Detection and Measurement contains 23 papers across benchmarking, clinical task characterization, and EHR documentation analysis, while Clinical Applications houses 11 papers examining domain-specific manifestations. The original paper bridges these branches by using interpretability tools (SAEs) not merely to detect bias but to causally intervene and assess mitigation potential. Neighboring work like general debiasing techniques focuses on prompt engineering or data augmentation, whereas this paper targets internal model representations. The scope note for this leaf emphasizes healthcare-specific adaptations, distinguishing it from both generic fairness methods and clinical prediction model adjustments.

Among 16 candidates examined, contribution-level analysis shows mixed novelty signals. The SAE application to clinical notes (Contribution 1) examined one candidate with no clear refutation. The causal steering and chain-of-thought unfaithfulness analysis (Contribution 2) examined five candidates, with one appearing to provide overlapping prior work on model steering or reasoning transparency. The SAE-based bias detection and mitigation assessment (Contribution 3) examined ten candidates without clear refutation. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—not an exhaustive literature review. The steering contribution shows more substantial prior work overlap within this constrained sample.

Given the limited search scale and sparse taxonomy leaf, the work appears to occupy a relatively novel position in applying mechanistic interpretability to healthcare bias mitigation. The SAE-based approach differentiates it from sibling papers, though the steering methodology shows some overlap with existing interpretability research. The analysis covers semantic neighbors and direct citations but cannot rule out relevant work outside the top-16 candidates examined. The sparse leaf population suggests either emerging interest or underrepresentation of interpretability-driven debiasing in current healthcare NLP literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
16
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: revealing and mitigating racial biases in healthcare language models. The field has organized itself around four main branches that reflect both the diagnostic and therapeutic sides of bias research. Bias Detection and Measurement in Healthcare LLMs encompasses foundational work on identifying disparities in model outputs, often through systematic audits of demographic performance gaps (e.g., Demographic Disparities Medical LLMs[2], Sociodemographic Biases Medical LLMs[5]). Clinical Applications and Domain-Specific Bias Studies examines how biases manifest in particular medical contexts—from emergency triage (Missingness Racial Disparities Triage[12]) to psychiatric assessments (Racial Bias Psychiatric LLMs[18])—revealing that bias patterns vary substantially across specialties. Foundational Bias Research and Methodological Frameworks provides the conceptual scaffolding, drawing on broader fairness literature (Social Biases Language Models[1]) and establishing evaluation protocols. Finally, Bias Mitigation and Fairness Enhancement Strategies focuses on interventions, ranging from data augmentation and prompt engineering to architectural modifications that aim to reduce disparate outcomes. Within the mitigation branch, a particularly active line of work explores healthcare-specific debiasing techniques that go beyond generic fairness methods. SAEs Racial Bias Healthcare[0] exemplifies this direction by using sparse autoencoders to interpret and intervene on racial bias representations within clinical language models, offering a mechanistic lens on how biases encode themselves in model internals. This approach contrasts with neighboring studies like Healthcare Racial LGBTQ Biases[8], which documents intersectional disparities across multiple demographic axes, and Linguistic Bias BERT Medical[25], which examines how pre-trained embeddings inherit biased associations from clinical corpora. A central tension across these works is whether mitigation should target model architectures, training data composition, or deployment-time guardrails—each strategy trading off between interpretability, scalability, and the risk of obscuring rather than eliminating bias. The original paper's focus on interpretable debiasing through sparse feature extraction positions it among efforts seeking not just fairness improvements but also transparency in how models encode sensitive attributes.

Claimed Contributions

Applying SAEs to clinical notes to reveal model associations between race and stigmatizing concepts

The authors apply Sparse Autoencoders to clinical notes and demonstrate that SAE latents can uncover problematic associations that LLMs have learned between patient race and stigmatizing concepts. This represents one of the first assessments of SAEs for LLMs in clinical applications.

1 retrieved paper
Establishing causality via model steering and showing CoT unfaithfulness

The authors use SAE latents to causally steer model behavior, demonstrating that activating race-related latents changes clinical predictions. They further show that chain-of-thought reasoning does not reveal this reliance on race, establishing that CoT explanations are unfaithful in this context.

5 retrieved papers
Can Refute
Assessing SAE-based bias detection and mitigation in clinical tasks

The authors evaluate whether ablating race-related SAE latents can mitigate bias in clinical generation tasks. They find that this approach works for simple tasks but has limited utility for more realistic and complex clinical applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Applying SAEs to clinical notes to reveal model associations between race and stigmatizing concepts

The authors apply Sparse Autoencoders to clinical notes and demonstrate that SAE latents can uncover problematic associations that LLMs have learned between patient race and stigmatizing concepts. This represents one of the first assessments of SAEs for LLMs in clinical applications.

Contribution

Establishing causality via model steering and showing CoT unfaithfulness

The authors use SAE latents to causally steer model behavior, demonstrating that activating race-related latents changes clinical predictions. They further show that chain-of-thought reasoning does not reveal this reliance on race, establishing that CoT explanations are unfaithful in this context.

Contribution

Assessing SAE-based bias detection and mitigation in clinical tasks

The authors evaluate whether ablating race-related SAE latents can mitigate bias in clinical generation tasks. They find that this approach works for simple tasks but has limited utility for more realistic and complex clinical applications.