ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Hateful Meme Detection

Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15% and 17% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ExPO-HM, a framework combining supervised fine-tuning warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) for explainable hateful meme detection. It resides in the Chain-of-Thought and Multi-Step Reasoning leaf, which contains five papers total including the original work. This leaf sits within the broader Reasoning-Enhanced Detection Frameworks branch, indicating a moderately populated research direction focused on sequential inference and interpretable classification pathways rather than direct end-to-end detection.

The taxonomy reveals neighboring leaves addressing related reasoning paradigms: Multi-Agent Reasoning and Debate explores argumentation-based classification, Rationale Distillation transfers reasoning knowledge to smaller models, and Evolutionary and Contextual Reasoning models cultural progression. These sibling branches share the goal of interpretable detection but diverge in mechanism—ExPO-HM emphasizes policy optimization over chain-of-thought, whereas debate methods use multi-agent conflict resolution. The broader Explainability and Interpretability Methods branch focuses on post-hoc justifications rather than reasoning-guided classification, highlighting ExPO-HM's positioning at the intersection of reasoning and explainability.

Among thirty candidates examined, the ExPO-HM framework contribution shows one refutable candidate out of ten examined, suggesting some prior work in explain-then-detect architectures. The Conditional Decision Entropy metric examined ten candidates with none refutable, indicating potential novelty in using entropy-based rewards for reasoning quality. The evaluation framework contribution also examined ten candidates without refutations, though comprehensive benchmarking is common in this field. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage, and the single refutation for the core framework warrants closer inspection of overlapping prior methods.

Based on the thirty candidates examined, the work appears to occupy a moderately explored space within reasoning-enhanced detection, with the CDE metric showing stronger novelty signals than the overall framework architecture. The taxonomy structure confirms this is an active research direction with multiple competing approaches, though not as densely populated as end-to-end classification methods. The analysis captures semantic proximity but cannot rule out relevant work outside the top-K retrieval scope or in adjacent communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: explainable hateful meme detection with reasoning-guided classification. The field has evolved into a rich landscape organized around several complementary directions. Reasoning-Enhanced Detection Frameworks emphasize multi-step inference and chain-of-thought processes to unpack the implicit meanings in memes, often leveraging large language models to generate intermediate rationales before classification. Explainability and Interpretability Methods focus on producing human-understandable justifications—whether through attention visualization, rationale generation, or post-hoc explanations—to clarify why a meme is flagged as hateful. Knowledge-Augmented Detection integrates external commonsense or cultural knowledge bases to capture nuances that pure vision-language models might miss. End-to-End Multimodal Classification pursues tightly integrated architectures that fuse text and image features without explicit reasoning steps, prioritizing efficiency and scalability. Specialized Detection Tasks and Settings address domain-specific challenges such as misogyny detection or cross-domain generalization, while Broader Multimodal Content Analysis extends techniques to related problems like misinformation or general meme understanding. Survey and Review Literature synthesizes these threads, documenting datasets, benchmarks, and emerging trends. Within this ecosystem, a particularly active line of work explores how to decompose the detection process into interpretable stages. For instance, Decoupled Understanding[5] and MemHateCaptioning[27] investigate separating visual and textual reasoning before final classification, aiming to make each decision step transparent. ExPO-HM[0] sits squarely in the Reasoning-Enhanced Detection branch, emphasizing chain-of-thought and multi-step reasoning to guide classification with explicit intermediate explanations. Compared to approaches like Demystifying Hateful Content[3], which may focus more on post-hoc interpretability, ExPO-HM[0] integrates reasoning directly into the detection pipeline, producing rationales that inform the final label. This contrasts with purely end-to-end methods that optimize for accuracy without surfacing intermediate logic. The central tension across these branches remains balancing model transparency with detection performance, and ExPO-HM[0] addresses this by embedding reasoning as a core architectural component rather than an auxiliary output.

Claimed Contributions

ExPO-HM framework for Explain-then-Detect hateful meme detection

Can Refute

10 retrieved papers

The authors introduce ExPO-HM, a framework that combines SFT warmup on policy manuals, GRPO with curriculum learning, and Conditional Decision Entropy rewards to enable hateful meme detection systems that generate explanations before making predictions, mimicking how human moderators are trained.

10 retrieved papers

Can Refute

Conditional Decision Entropy (CDE) metric and reward

10 retrieved papers

The authors propose CDE, which measures the entropy of a model's decision conditioned on its generated explanation. CDE serves dual purposes: evaluating reasoning quality and providing a reward signal during training to encourage confident correct predictions while penalizing confident errors.

10 retrieved papers

Comprehensive evaluation framework for hateful meme detection

10 retrieved papers

The authors establish an evaluation framework that assesses models not only on binary hateful versus benign classification but also on fine-grained categories such as attack types and target groups, plus reasoning quality judged by LLMs, better reflecting real-world moderation needs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning PDF

Luu Anh Tuan, Fengjun Pan, Wu Xiaobao, A. Luu, Xiaobao Wu (2025) • arXiv.org

[7] SAFE-MEME: Structured reasoning framework for robust hate speech detection in memes PDF

Sharma, Shivam, Palash Nandi, Chakraborty, Tanmoy, Shivam Sharma, Tanmoy Chakraborty (2024)

[24] MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection PDF

Gu, Hexiang, Yu, Qifan, Hexiang Gu, Hou, Saihui, Qifan Yu, Fang Zhiqin, Saihui Hou, Wu Huijia, Zhiqin Fang, He, Zhaofeng, Huijia Wu, Zhaofeng He (2025)

[27] MemHateCaptioning: Enhancing Hate Speech Detection in Memes with Context-Aware Captioning and Chain-of-Thought PDF

Rishik Sood, Ali Anaissi, Weidong Huang, Ali Braytee (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ExPO-HM framework for Explain-then-Detect hateful meme detection

[77] Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling PDF

Can Refute

[69] ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation PDF

Cannot Refute

[70] MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance PDF

Cannot Refute

[71] Detoxifying language model outputs: combining multi-agent debates and reinforcement learning for improved summarization PDF

Cannot Refute

[72] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization PDF

Cannot Refute

[73] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection PDF

Cannot Refute

[74] T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation PDF

Cannot Refute

[75] Entity-Aware Optimal Transport and Residual Attention for Multimodal Content Moderation PDF

Cannot Refute

[76] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization PDF

Cannot Refute

[78] PPO-XLM-R Enhanced Crowd Intelligence for Early Misinformation Prediction in Urdu Social Media PDF

Cannot Refute

Contribution

Conditional Decision Entropy (CDE) metric and reward

[51] Entropy-based logic explanations of neural networks PDF

Cannot Refute

[52] An Explainable Machine Learning Network for Classification of Autism Spectrum Disorder Using Optimal Frequency Band Identification From Brain EEG PDF

Cannot Refute

[53] On entropy-based term weighting schemes for text categorization PDF

Cannot Refute

[54] Improved GraphSVX for GNN Explanations Based on Cross Entropy PDF

Cannot Refute

[55] Entropy-based fuzzy support vector machine for imbalanced datasets PDF

Cannot Refute

[56] Explainable ResNet50 learning model based on copula entropy for cotton plant disease prediction PDF

Cannot Refute

[57] Explaining a machine-learning lane change model with maximum entropy Shapley values PDF

Cannot Refute

[58] Edge entropy as an indicator of the effectiveness of gnns over cnns for node classification PDF

Cannot Refute

[59] Metric Learning in Freewill EEG Pre-Movement and Movement Intention Classification for Brain Machine Interfaces PDF

Cannot Refute

[60] Entropy Reweighted Conformal Classification PDF

Cannot Refute

Contribution

Comprehensive evaluation framework for hateful meme detection

[7] SAFE-MEME: Structured reasoning framework for robust hate speech detection in memes PDF

Cannot Refute

[41] Modularized Networks for Few-shot Hateful Meme Detection PDF

Cannot Refute

[61] Transparent hate speech detection in norwegian using explainable ai PDF

Cannot Refute

[62] Hatexplain: A benchmark dataset for explainable hate speech detection PDF

Cannot Refute

[63] Peace: Providing explanations and analysis for combating hate expressions PDF

Cannot Refute

[64] Abusive Content Detection with LangChain Integration Using Fine-Tuned RoBERTa PDF

Cannot Refute

[65] Fine-grained multilingual Hate speech detection using Explainable AI and Transformers PDF

Cannot Refute

[66] HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning PDF

Cannot Refute

[67] SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations PDF

Cannot Refute

[68] Rule by example: Harnessing logical rules for explainable hate speech detection PDF

Cannot Refute

ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning PDF

[7] SAFE-MEME: Structured reasoning framework for robust hate speech detection in memes PDF

[24] MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection PDF

[27] MemHateCaptioning: Enhancing Hate Speech Detection in Memes with Context-Aware Captioning and Chain-of-Thought PDF

Contribution Analysis

ExPO-HM framework for Explain-then-Detect hateful meme detection

[77] Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling PDF

[69] ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation PDF

[70] MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance PDF

[71] Detoxifying language model outputs: combining multi-agent debates and reinforcement learning for improved summarization PDF

[72] Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization PDF

[73] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection PDF

[74] T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation PDF

[75] Entity-Aware Optimal Transport and Residual Attention for Multimodal Content Moderation PDF

[76] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization PDF

[78] PPO-XLM-R Enhanced Crowd Intelligence for Early Misinformation Prediction in Urdu Social Media PDF

Conditional Decision Entropy (CDE) metric and reward

[51] Entropy-based logic explanations of neural networks PDF

[52] An Explainable Machine Learning Network for Classification of Autism Spectrum Disorder Using Optimal Frequency Band Identification From Brain EEG PDF

[53] On entropy-based term weighting schemes for text categorization PDF

[54] Improved GraphSVX for GNN Explanations Based on Cross Entropy PDF

[55] Entropy-based fuzzy support vector machine for imbalanced datasets PDF

[56] Explainable ResNet50 learning model based on copula entropy for cotton plant disease prediction PDF

[57] Explaining a machine-learning lane change model with maximum entropy Shapley values PDF

[58] Edge entropy as an indicator of the effectiveness of gnns over cnns for node classification PDF

[59] Metric Learning in Freewill EEG Pre-Movement and Movement Intention Classification for Brain Machine Interfaces PDF

[60] Entropy Reweighted Conformal Classification PDF

Comprehensive evaluation framework for hateful meme detection

[7] SAFE-MEME: Structured reasoning framework for robust hate speech detection in memes PDF

[41] Modularized Networks for Few-shot Hateful Meme Detection PDF

[61] Transparent hate speech detection in norwegian using explainable ai PDF

[62] Hatexplain: A benchmark dataset for explainable hate speech detection PDF

[63] Peace: Providing explanations and analysis for combating hate expressions PDF

[64] Abusive Content Detection with LangChain Integration Using Fine-Tuned RoBERTa PDF

[65] Fine-grained multilingual Hate speech detection using Explainable AI and Transformers PDF

[66] HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning PDF

[67] SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations PDF

[68] Rule by example: Harnessing logical rules for explainable hate speech detection PDF

Table of Contents