Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

FairnessExplainabilityHate speech detection

Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper presents a systematic quantitative study examining how explainability methods relate to fairness in hate speech detection, focusing on three applications: identifying biased predictions, selecting fair models, and mitigating bias during training. It resides in the 'Explainability for Fairness Assessment and Auditing' leaf, which contains five papers total. This leaf sits within the broader 'Integrated Explainability-Fairness Studies' branch, indicating a moderately populated research direction that explicitly bridges transparency and equity concerns rather than treating them separately.

The taxonomy reveals neighboring work in 'Fairness-Aware Explainable Model Design' (three papers on joint optimization) and separate branches for pure explainability methods (fourteen papers across post-hoc, rationale-based, and concept-based techniques) and pure fairness analysis (seven papers on bias detection and mitigation). The paper's leaf focuses specifically on using explanations as diagnostic tools for fairness auditing, distinguishing it from sibling work like systematic auditing frameworks or individual biased prediction detection. This positioning reflects the field's evolution toward recognizing explainability and fairness as mutually reinforcing rather than independent dimensions.

Among thirty candidates examined, none clearly refute the three core contributions. The systematic study of the explainability-fairness relationship (ten candidates examined, zero refutable) appears novel in its comprehensive quantitative scope across encoder and decoder models. The quantitative evaluation framework for three fairness-explainability applications (ten candidates, zero refutable) and the empirical findings on explanation effectiveness (ten candidates, zero refutable) similarly show no substantial prior overlap within the limited search. The absence of refutable candidates suggests these contributions occupy relatively unexplored territory, though the modest search scale leaves open the possibility of relevant work beyond the top-thirty semantic matches.

Based on the limited literature search, the work appears to make substantive contributions to an active but not overcrowded research area. The taxonomy structure shows the field has established separate traditions for explainability and fairness, with the integrated branch representing a newer synthesis. The analysis covers top-thirty semantic matches and does not claim exhaustive coverage of all potentially relevant prior work in adjacent domains or earlier publication venues.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: understanding the relationship between explainability and fairness in hate speech detection. The field has evolved into several interconnected branches that address different facets of this challenge. Explainability Methods and Frameworks for Hate Speech Detection focus on making model decisions transparent through techniques like rationale extraction and post-hoc interpretation, as seen in works such as Hatexplain Benchmark[2] and Explainable Offensive Classifier[3]. Fairness-Oriented Bias Analysis and Mitigation examines how models encode and perpetuate biases, with studies like Gender Biases Offensive[6] and Racial Dialect Bias[47] revealing systematic disparities. The Integrated Explainability-Fairness Studies branch explicitly bridges these concerns, using explanations to audit fairness properties, while Datasets, Benchmarks, and Evaluation Frameworks provide shared resources like COLD Benchmark[5] for rigorous assessment. Sociotechnical and Ethical Perspectives broaden the lens to consider stakeholder needs and human rights implications, and General Fairness and Explainability Toolkits offer reusable infrastructure across domains. A particularly active line of work explores how explanations can serve as diagnostic tools for uncovering bias, with studies like Explanations Bias Detectors[16] and Biased Models Explanations[44] demonstrating that interpretability methods can reveal when models rely on protected attributes or spurious correlations. Bridging Fairness Explainability[0] sits squarely within this integrated cluster, emphasizing the use of explainability techniques to assess and improve fairness in hate speech classifiers. Compared to Fairshades Auditing[34], which provides systematic frameworks for bias evaluation, and Detecting Biased Predictions[46], which focuses on identifying individual biased outputs, the original work synthesizes these perspectives to show how transparency mechanisms can both diagnose fairness issues and guide mitigation strategies. This positioning reflects a growing recognition that explainability and fairness are not separate desiderata but mutually reinforcing dimensions of trustworthy content moderation systems.

Claimed Contributions

Systematic study of explainability and fairness relationship in hate speech detection

10 retrieved papers

The authors present the first comprehensive quantitative analysis examining how input-based explanations relate to fairness in hate speech detection models. They investigate this relationship across three key dimensions: identifying biased predictions, selecting fair models, and mitigating bias during training.

10 retrieved papers

Quantitative evaluation framework for three fairness-explainability applications

10 retrieved papers

The authors develop a systematic evaluation framework that quantitatively assesses input-based explanations across three distinct applications: detecting biased predictions at inference time, automatically selecting fair models from candidates, and providing supervision signals for bias mitigation during training.

10 retrieved papers

Empirical findings on effectiveness of explanations for fairness tasks

10 retrieved papers

The authors provide empirical evidence demonstrating that input-based explanations are effective for identifying biased predictions and mitigating bias through training regularization, while showing they are not reliable for model selection tasks. They also show these methods remain robust in explanation-debiased models and outperform LLM-based bias detection.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Explanations as Bias Detectors: A Critical Study of Local Post-hoc XAI Methods for Fairness Exploration PDF

Vasiliki Papanikou, Pitoura, Evaggelia, Danae Pla Karidi, Panagiotou, Emmanouil, E. Pitoura, Ntoutsi, Eirini, Emmanouil Panagiotou, E. Ntoutsi (2025)

[34] Fairshades: Fairness auditing via explainability in abusive language detection systems PDF

Marta Marchiori Manerba, Riccardo Guidotti (2021)

[44] Biased Models Have Biased Explanations PDF

Jain Aditya, Ravula, Manish, Aditya Jain, Ghosh, Joydeep, Manish Ravula, J. Ghosh (2022)

[46] On Detecting Biased Predictions with Post-hoc Explanation Methods PDF

Matteo Ruggeri, Arnaud Dethise, Alice Dethise, Marco Canini (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic study of explainability and fairness relationship in hate speech detection

[1] Socially responsible and explainable automated fact-checking and hate speech detection PDF

Cannot Refute

[2] Hatexplain: A benchmark dataset for explainable hate speech detection PDF

Cannot Refute

[6] On Gender Biases in Offensive Language Classification Models PDF

Cannot Refute

[12] Understanding interpretability: explainable AI approaches for hate speech classifiers PDF

Cannot Refute

[25] Improving hate speech classification through ensemble learning and explainable AI techniques PDF

Cannot Refute

[31] Interpretable and high-performance hate and offensive speech detection PDF

Cannot Refute

[34] Fairshades: Fairness auditing via explainability in abusive language detection systems PDF

Cannot Refute

[60] A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification â¦ PDF

Cannot Refute

[61] The Reliability Fallacy: How Label Ambiguity Undermines AI Hate Speech Detection PDF

Cannot Refute

[62] QBERTox: A Quantum-Enhanced Explainable Model for Cyberbullying Detection in a Code-Mixed Language PDF

Cannot Refute

Contribution

Quantitative evaluation framework for three fairness-explainability applications

[52] Fairness and Explanations in Entity Resolution: An Overview PDF

Cannot Refute

[63] Bias and unfairness in machine learning models: a systematic review on datasets, tools, fairness metrics, and identification and mitigation methods PDF

Cannot Refute

[64] Beyond the black box: Interpretability of llms in finance PDF

Cannot Refute

[65] Developing interpretable models for complex decision-making PDF

Cannot Refute

[66] Transparency in translation: A deep dive into explainable AI techniques for bias mitigation PDF

Cannot Refute

[67] Improving trust in AI with mitigating confirmation bias: Effects of explanation type and debiasing strategy for decision-making with explainable AI PDF

Cannot Refute

[68] The disagreement dilemma in explainable AI: Can bias reduction bridge the gap PDF

Cannot Refute

[69] Explaining knock-on effects of bias mitigation PDF

Cannot Refute

[70] Fairness-Aware Credit Risk Assessment Using Alternative Data: An Explainable AI Approach for Bias Detection and Mitigation PDF

Cannot Refute

[71] Human visual explanations mitigate bias in AI-based assessment of surgeon skills PDF

Cannot Refute

Contribution

Empirical findings on effectiveness of explanations for fairness tasks

[19] PreCoF: counterfactual explanations for fairness PDF

Cannot Refute

[51] Counterfactual Explanations for Recommendation Bias PDF

Cannot Refute

[52] Fairness and Explanations in Entity Resolution: An Overview PDF

Cannot Refute

[53] Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias PDF

Cannot Refute

[54] Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations PDF

Cannot Refute

[55] AIM: Attributing, interpreting, mitigating data unfairness PDF

Cannot Refute

[56] Discovering and mitigating visual biases through keyword explanation PDF

Cannot Refute

[57] Interact with the explanations: Causal debiased explainable recommendation system PDF

Cannot Refute

[58] SIGNed explanations: Unveiling relevant features by reducing bias PDF

Cannot Refute

[59] Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization PDF

Cannot Refute

Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Explanations as Bias Detectors: A Critical Study of Local Post-hoc XAI Methods for Fairness Exploration PDF

[34] Fairshades: Fairness auditing via explainability in abusive language detection systems PDF

[44] Biased Models Have Biased Explanations PDF

[46] On Detecting Biased Predictions with Post-hoc Explanation Methods PDF

Contribution Analysis

Systematic study of explainability and fairness relationship in hate speech detection

[1] Socially responsible and explainable automated fact-checking and hate speech detection PDF

[2] Hatexplain: A benchmark dataset for explainable hate speech detection PDF

[6] On Gender Biases in Offensive Language Classification Models PDF

[12] Understanding interpretability: explainable AI approaches for hate speech classifiers PDF

[25] Improving hate speech classification through ensemble learning and explainable AI techniques PDF

[31] Interpretable and high-performance hate and offensive speech detection PDF

[34] Fairshades: Fairness auditing via explainability in abusive language detection systems PDF

[60] A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification â¦ PDF

[61] The Reliability Fallacy: How Label Ambiguity Undermines AI Hate Speech Detection PDF

[62] QBERTox: A Quantum-Enhanced Explainable Model for Cyberbullying Detection in a Code-Mixed Language PDF

Quantitative evaluation framework for three fairness-explainability applications

[52] Fairness and Explanations in Entity Resolution: An Overview PDF

[63] Bias and unfairness in machine learning models: a systematic review on datasets, tools, fairness metrics, and identification and mitigation methods PDF

[64] Beyond the black box: Interpretability of llms in finance PDF

[65] Developing interpretable models for complex decision-making PDF

[66] Transparency in translation: A deep dive into explainable AI techniques for bias mitigation PDF

[67] Improving trust in AI with mitigating confirmation bias: Effects of explanation type and debiasing strategy for decision-making with explainable AI PDF

[68] The disagreement dilemma in explainable AI: Can bias reduction bridge the gap PDF

[69] Explaining knock-on effects of bias mitigation PDF

[70] Fairness-Aware Credit Risk Assessment Using Alternative Data: An Explainable AI Approach for Bias Detection and Mitigation PDF

[71] Human visual explanations mitigate bias in AI-based assessment of surgeon skills PDF

Empirical findings on effectiveness of explanations for fairness tasks

[19] PreCoF: counterfactual explanations for fairness PDF

[51] Counterfactual Explanations for Recommendation Bias PDF

[52] Fairness and Explanations in Entity Resolution: An Overview PDF

[53] Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias PDF

[54] Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations PDF

[55] AIM: Attributing, interpreting, mitigating data unfairness PDF

[56] Discovering and mitigating visual biases through keyword explanation PDF

[57] Interact with the explanations: Causal debiased explainable recommendation system PDF

[58] SIGNed explanations: Unveiling relevant features by reducing bias PDF

[59] Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization PDF

Table of Contents

[60] A Multi-Task Text Classification Pipeline with Natural Language Explanations: A User-Centric Evaluation in Sentiment Analysis and Offensive Language Identification â¦ PDF