Abstract:

Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15% and 17% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ExPO-HM, a framework combining supervised fine-tuning warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) for explainable hateful meme detection. It resides in the Chain-of-Thought and Multi-Step Reasoning leaf, which contains five papers total including the original work. This leaf sits within the broader Reasoning-Enhanced Detection Frameworks branch, indicating a moderately populated research direction focused on sequential inference and interpretable classification pathways rather than direct end-to-end detection.

The taxonomy reveals neighboring leaves addressing related reasoning paradigms: Multi-Agent Reasoning and Debate explores argumentation-based classification, Rationale Distillation transfers reasoning knowledge to smaller models, and Evolutionary and Contextual Reasoning models cultural progression. These sibling branches share the goal of interpretable detection but diverge in mechanism—ExPO-HM emphasizes policy optimization over chain-of-thought, whereas debate methods use multi-agent conflict resolution. The broader Explainability and Interpretability Methods branch focuses on post-hoc justifications rather than reasoning-guided classification, highlighting ExPO-HM's positioning at the intersection of reasoning and explainability.

Among thirty candidates examined, the ExPO-HM framework contribution shows one refutable candidate out of ten examined, suggesting some prior work in explain-then-detect architectures. The Conditional Decision Entropy metric examined ten candidates with none refutable, indicating potential novelty in using entropy-based rewards for reasoning quality. The evaluation framework contribution also examined ten candidates without refutations, though comprehensive benchmarking is common in this field. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage, and the single refutation for the core framework warrants closer inspection of overlapping prior methods.

Based on the thirty candidates examined, the work appears to occupy a moderately explored space within reasoning-enhanced detection, with the CDE metric showing stronger novelty signals than the overall framework architecture. The taxonomy structure confirms this is an active research direction with multiple competing approaches, though not as densely populated as end-to-end classification methods. The analysis captures semantic proximity but cannot rule out relevant work outside the top-K retrieval scope or in adjacent communities.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: explainable hateful meme detection with reasoning-guided classification. The field has evolved into a rich landscape organized around several complementary directions. Reasoning-Enhanced Detection Frameworks emphasize multi-step inference and chain-of-thought processes to unpack the implicit meanings in memes, often leveraging large language models to generate intermediate rationales before classification. Explainability and Interpretability Methods focus on producing human-understandable justifications—whether through attention visualization, rationale generation, or post-hoc explanations—to clarify why a meme is flagged as hateful. Knowledge-Augmented Detection integrates external commonsense or cultural knowledge bases to capture nuances that pure vision-language models might miss. End-to-End Multimodal Classification pursues tightly integrated architectures that fuse text and image features without explicit reasoning steps, prioritizing efficiency and scalability. Specialized Detection Tasks and Settings address domain-specific challenges such as misogyny detection or cross-domain generalization, while Broader Multimodal Content Analysis extends techniques to related problems like misinformation or general meme understanding. Survey and Review Literature synthesizes these threads, documenting datasets, benchmarks, and emerging trends. Within this ecosystem, a particularly active line of work explores how to decompose the detection process into interpretable stages. For instance, Decoupled Understanding[5] and MemHateCaptioning[27] investigate separating visual and textual reasoning before final classification, aiming to make each decision step transparent. ExPO-HM[0] sits squarely in the Reasoning-Enhanced Detection branch, emphasizing chain-of-thought and multi-step reasoning to guide classification with explicit intermediate explanations. Compared to approaches like Demystifying Hateful Content[3], which may focus more on post-hoc interpretability, ExPO-HM[0] integrates reasoning directly into the detection pipeline, producing rationales that inform the final label. This contrasts with purely end-to-end methods that optimize for accuracy without surfacing intermediate logic. The central tension across these branches remains balancing model transparency with detection performance, and ExPO-HM[0] addresses this by embedding reasoning as a core architectural component rather than an auxiliary output.

Claimed Contributions

ExPO-HM framework for Explain-then-Detect hateful meme detection

The authors introduce ExPO-HM, a framework that combines SFT warmup on policy manuals, GRPO with curriculum learning, and Conditional Decision Entropy rewards to enable hateful meme detection systems that generate explanations before making predictions, mimicking how human moderators are trained.

10 retrieved papers
Can Refute
Conditional Decision Entropy (CDE) metric and reward

The authors propose CDE, which measures the entropy of a model's decision conditioned on its generated explanation. CDE serves dual purposes: evaluating reasoning quality and providing a reward signal during training to encourage confident correct predictions while penalizing confident errors.

10 retrieved papers
Comprehensive evaluation framework for hateful meme detection

The authors establish an evaluation framework that assesses models not only on binary hateful versus benign classification but also on fine-grained categories such as attack types and target groups, plus reasoning quality judged by LLMs, better reflecting real-world moderation needs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ExPO-HM framework for Explain-then-Detect hateful meme detection

The authors introduce ExPO-HM, a framework that combines SFT warmup on policy manuals, GRPO with curriculum learning, and Conditional Decision Entropy rewards to enable hateful meme detection systems that generate explanations before making predictions, mimicking how human moderators are trained.

Contribution

Conditional Decision Entropy (CDE) metric and reward

The authors propose CDE, which measures the entropy of a model's decision conditioned on its generated explanation. CDE serves dual purposes: evaluating reasoning quality and providing a reward signal during training to encourage confident correct predictions while penalizing confident errors.

Contribution

Comprehensive evaluation framework for hateful meme detection

The authors establish an evaluation framework that assesses models not only on binary hateful versus benign classification but also on fine-grained categories such as attack types and target groups, plus reasoning quality judged by LLMs, better reflecting real-world moderation needs.