EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Emotion HallucinationEmotion UnderstandingAffective Computing
Abstract:

Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from ``hallucinations'', generating irrelevant or nonsensical content. To the best of our knowledge, and despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce \textbf{EmotionHallucer}, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this knowledge, we assess emotion hallucinations from two perspectives: emotion psychology knowledge and realworld multimodal perception. To support robust evaluation, we utilize an adversarial binary question–answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 41 LLMs and MLLMs on EmotionHallucer, we find that: (1) most current models exhibit substantial issues with emotion hallucinations; (2) closed-source models outperform open-source models in detecting emotion hallucinations, and reasoning capability provides additional advantages; and (3) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the \textbf{PEP-MEK} framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available on GitHub.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EmotionHallucer, the first dedicated benchmark for detecting emotion-related hallucinations in multimodal large language models. Within the taxonomy, it occupies the 'Emotion Hallucination Benchmarking' leaf under 'Emotion Understanding and Hallucination Evaluation'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the same category. This isolation suggests the work addresses a previously uncharted niche: while general hallucination detection methods exist in parallel branches, none specifically target emotion misattribution through adversarial psychology-grounded evaluation.

The taxonomy reveals neighboring research directions that contextualize this contribution. The parent branch includes human-centric video emotion assessment, cross-lingual bimodal emotion recognition, and appraisal theory-informed prediction—all addressing emotion understanding but without hallucination-focused evaluation frameworks. A sibling top-level branch covers general hallucination mitigation through representation alignment, preference optimization, and causal debiasing. EmotionHallucer bridges these domains by applying hallucination detection principles specifically to affective content, diverging from both generic multimodal benchmarks and pure emotion recognition systems that lack adversarial robustness testing.

Among thirty candidates examined across three contributions, none yielded refutable prior work. The EmotionHallucer benchmark itself (ten candidates examined, zero refutations) appears novel within this search scope, as does the comprehensive evaluation of forty-one models and the PEP-MEK mitigation framework (each ten candidates, zero refutations). This absence of overlapping work aligns with the taxonomy structure showing no sibling papers in the same leaf. However, the limited search scale means unexplored literature beyond top-thirty semantic matches could contain relevant emotion-hallucination studies not captured here.

Based on the examined candidates and taxonomy position, the work occupies a sparse research direction at the intersection of emotion understanding and hallucination evaluation. The lack of sibling papers and zero refutations across contributions suggest substantive novelty within the scope analyzed. Nonetheless, the thirty-candidate search represents a bounded exploration—broader surveys of affective computing or multimodal robustness literature might reveal adjacent efforts not surfaced by semantic similarity alone.

Taxonomy

Core-task Taxonomy Papers
8
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating emotion hallucinations in multimodal large language models. The field structure reflects three main branches that together address how MLLMs perceive and reason about emotions while managing the risk of generating unfaithful outputs. The first branch, Hallucination Detection and Mitigation in MLLMs, encompasses general techniques for identifying and reducing hallucinations across modalities, including contrastive learning approaches like Hallucination Augmented Contrastive[1] and preference-based methods such as Hallucination-Aware DPO[2]. The second branch, Emotion Understanding and Hallucination Evaluation, focuses specifically on how models interpret affective states and whether their emotion-related outputs align with ground truth, drawing on works like Cross-Lingual Bimodal Emotion[3] and debiasing strategies exemplified by Counterfactual Debiasing MLLMs[4]. The third branch, Emotion-Informed Downstream Applications, explores how emotion recognition feeds into practical tasks, with systems like ETICD-Net[5] and Image Analyzer[6] leveraging affective cues for richer multimodal understanding. A particularly active line of work examines the intersection of hallucination mitigation and emotion-specific evaluation, where researchers ask whether standard hallucination metrics capture the nuances of affective misattribution. EmotionHallucer[0] sits squarely within the Emotion Understanding and Hallucination Evaluation branch, specifically targeting emotion hallucination benchmarking. Unlike broader hallucination studies that treat all modalities uniformly, EmotionHallucer[0] zeroes in on the unique challenges of emotion perception, where subjective interpretation and cultural context can blur the line between genuine error and plausible variation. This contrasts with works like HumanVBench[8], which may assess general human-centric attributes, and complements theoretical frameworks such as Appraisal Theory Emotion[7] by grounding abstract emotion models in concrete evaluation protocols. The main open question remains how to balance fine-grained emotion taxonomies with scalable, reproducible benchmarking across diverse multimodal architectures.

Claimed Contributions

EmotionHallucer benchmark for emotion hallucination evaluation

The authors present EmotionHallucer, the first dedicated benchmark to evaluate emotion-related hallucinations in multimodal large language models. It assesses hallucinations from two perspectives: emotion psychology knowledge (factuality hallucination) and real-world multimodal perception (faithfulness hallucination), using an adversarial binary question-answer framework across seven subcategories and four modalities.

10 retrieved papers
Comprehensive evaluation of 41 LLMs and MLLMs with three key findings

The authors evaluate 41 large language models and multimodal large language models on EmotionHallucer, revealing three main findings: most models exhibit substantial emotion hallucination issues, closed-source models outperform open-source ones with reasoning capability providing advantages, and models perform better on emotion psychology knowledge than multimodal emotion perception.

10 retrieved papers
PEP-MEK framework for mitigating emotion hallucinations

The authors propose the Predict-Explain-Predict with Modality and Emotion Knowledge (PEP-MEK) framework, a plug-and-play method that incorporates modality-specific and emotional knowledge to reduce emotion hallucinations. Experiments show an average improvement of 9.90% in emotion hallucination detection across selected models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EmotionHallucer benchmark for emotion hallucination evaluation

The authors present EmotionHallucer, the first dedicated benchmark to evaluate emotion-related hallucinations in multimodal large language models. It assesses hallucinations from two perspectives: emotion psychology knowledge (factuality hallucination) and real-world multimodal perception (faithfulness hallucination), using an adversarial binary question-answer framework across seven subcategories and four modalities.

Contribution

Comprehensive evaluation of 41 LLMs and MLLMs with three key findings

The authors evaluate 41 large language models and multimodal large language models on EmotionHallucer, revealing three main findings: most models exhibit substantial emotion hallucination issues, closed-source models outperform open-source ones with reasoning capability providing advantages, and models perform better on emotion psychology knowledge than multimodal emotion perception.

Contribution

PEP-MEK framework for mitigating emotion hallucinations

The authors propose the Predict-Explain-Predict with Modality and Emotion Knowledge (PEP-MEK) framework, a plug-and-play method that incorporates modality-specific and emotional knowledge to reduce emotion hallucinations. Experiments show an average improvement of 9.90% in emotion hallucination detection across selected models.