EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Emotion HallucinationEmotion UnderstandingAffective Computing

Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from ``hallucinations'', generating irrelevant or nonsensical content. To the best of our knowledge, and despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce \textbf{EmotionHallucer}, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this knowledge, we assess emotion hallucinations from two perspectives: emotion psychology knowledge and realworld multimodal perception. To support robust evaluation, we utilize an adversarial binary question–answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 41 LLMs and MLLMs on EmotionHallucer, we find that: (1) most current models exhibit substantial issues with emotion hallucinations; (2) closed-source models outperform open-source models in detecting emotion hallucinations, and reasoning capability provides additional advantages; and (3) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the \textbf{PEP-MEK} framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available on GitHub.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EmotionHallucer, the first dedicated benchmark for detecting emotion-related hallucinations in multimodal large language models. Within the taxonomy, it occupies the 'Emotion Hallucination Benchmarking' leaf under 'Emotion Understanding and Hallucination Evaluation'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the same category. This isolation suggests the work addresses a previously uncharted niche: while general hallucination detection methods exist in parallel branches, none specifically target emotion misattribution through adversarial psychology-grounded evaluation.

The taxonomy reveals neighboring research directions that contextualize this contribution. The parent branch includes human-centric video emotion assessment, cross-lingual bimodal emotion recognition, and appraisal theory-informed prediction—all addressing emotion understanding but without hallucination-focused evaluation frameworks. A sibling top-level branch covers general hallucination mitigation through representation alignment, preference optimization, and causal debiasing. EmotionHallucer bridges these domains by applying hallucination detection principles specifically to affective content, diverging from both generic multimodal benchmarks and pure emotion recognition systems that lack adversarial robustness testing.

Among thirty candidates examined across three contributions, none yielded refutable prior work. The EmotionHallucer benchmark itself (ten candidates examined, zero refutations) appears novel within this search scope, as does the comprehensive evaluation of forty-one models and the PEP-MEK mitigation framework (each ten candidates, zero refutations). This absence of overlapping work aligns with the taxonomy structure showing no sibling papers in the same leaf. However, the limited search scale means unexplored literature beyond top-thirty semantic matches could contain relevant emotion-hallucination studies not captured here.

Based on the examined candidates and taxonomy position, the work occupies a sparse research direction at the intersection of emotion understanding and hallucination evaluation. The lack of sibling papers and zero refutations across contributions suggest substantive novelty within the scope analyzed. Nonetheless, the thirty-candidate search represents a bounded exploration—broader surveys of affective computing or multimodal robustness literature might reveal adjacent efforts not surfaced by semantic similarity alone.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating emotion hallucinations in multimodal large language models. The field structure reflects three main branches that together address how MLLMs perceive and reason about emotions while managing the risk of generating unfaithful outputs. The first branch, Hallucination Detection and Mitigation in MLLMs, encompasses general techniques for identifying and reducing hallucinations across modalities, including contrastive learning approaches like Hallucination Augmented Contrastive[1] and preference-based methods such as Hallucination-Aware DPO[2]. The second branch, Emotion Understanding and Hallucination Evaluation, focuses specifically on how models interpret affective states and whether their emotion-related outputs align with ground truth, drawing on works like Cross-Lingual Bimodal Emotion[3] and debiasing strategies exemplified by Counterfactual Debiasing MLLMs[4]. The third branch, Emotion-Informed Downstream Applications, explores how emotion recognition feeds into practical tasks, with systems like ETICD-Net[5] and Image Analyzer[6] leveraging affective cues for richer multimodal understanding. A particularly active line of work examines the intersection of hallucination mitigation and emotion-specific evaluation, where researchers ask whether standard hallucination metrics capture the nuances of affective misattribution. EmotionHallucer[0] sits squarely within the Emotion Understanding and Hallucination Evaluation branch, specifically targeting emotion hallucination benchmarking. Unlike broader hallucination studies that treat all modalities uniformly, EmotionHallucer[0] zeroes in on the unique challenges of emotion perception, where subjective interpretation and cultural context can blur the line between genuine error and plausible variation. This contrasts with works like HumanVBench[8], which may assess general human-centric attributes, and complements theoretical frameworks such as Appraisal Theory Emotion[7] by grounding abstract emotion models in concrete evaluation protocols. The main open question remains how to balance fine-grained emotion taxonomies with scalable, reproducible benchmarking across diverse multimodal architectures.

Claimed Contributions

EmotionHallucer benchmark for emotion hallucination evaluation

10 retrieved papers

The authors present EmotionHallucer, the first dedicated benchmark to evaluate emotion-related hallucinations in multimodal large language models. It assesses hallucinations from two perspectives: emotion psychology knowledge (factuality hallucination) and real-world multimodal perception (faithfulness hallucination), using an adversarial binary question-answer framework across seven subcategories and four modalities.

10 retrieved papers

Comprehensive evaluation of 41 LLMs and MLLMs with three key findings

10 retrieved papers

The authors evaluate 41 large language models and multimodal large language models on EmotionHallucer, revealing three main findings: most models exhibit substantial emotion hallucination issues, closed-source models outperform open-source ones with reasoning capability providing advantages, and models perform better on emotion psychology knowledge than multimodal emotion perception.

10 retrieved papers

PEP-MEK framework for mitigating emotion hallucinations

10 retrieved papers

The authors propose the Predict-Explain-Predict with Modality and Emotion Knowledge (PEP-MEK) framework, a plug-and-play method that incorporates modality-specific and emotional knowledge to reduce emotion hallucinations. Experiments show an average improvement of 9.90% in emotion hallucination detection across selected models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EmotionHallucer benchmark for emotion hallucination evaluation

[1] Hallucination Augmented Contrastive Learning for Multimodal Large Language Model PDF

Cannot Refute

[10] A survey on hallucination in large vision-language models PDF

Cannot Refute

[12] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF

Cannot Refute

[19] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models PDF

Cannot Refute

[20] Visual hallucinations of multi-modal large language models PDF

Cannot Refute

[21] Hallucination of multimodal large language models: A survey PDF

Cannot Refute

[22] Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models PDF

Cannot Refute

[23] Retrieve-then-compare mitigates visual hallucination in multi-modal large language models PDF

Cannot Refute

[24] Evaluating Object Hallucination in Large Vision-Language Models PDF

Cannot Refute

[25] Unified hallucination detection for multimodal large language models PDF

Cannot Refute

Contribution

Comprehensive evaluation of 41 LLMs and MLLMs with three key findings

[26] Lmms-eval: Reality check on the evaluation of large multimodal models PDF

Cannot Refute

[27] A survey on evaluation of large language models PDF

Cannot Refute

[28] Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis PDF

Cannot Refute

[29] A systematic evaluation of large language models of code PDF

Cannot Refute

[30] Testing and evaluation of health care applications of large language models: a systematic review PDF

Cannot Refute

[31] ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning PDF

Cannot Refute

[32] A survey on multimodal large language models for autonomous driving PDF

Cannot Refute

[33] A survey on multimodal large language models PDF

Cannot Refute

[34] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

Cannot Refute

[35] Seed-bench: Benchmarking multimodal large language models PDF

Cannot Refute

Contribution

PEP-MEK framework for mitigating emotion hallucinations

[9] Skip\n: A simple method to reduce hallucination in large vision-language models PDF

Cannot Refute

[10] A survey on hallucination in large vision-language models PDF

Cannot Refute

[11] Analyzing and mitigating object hallucination in large vision-language models PDF

Cannot Refute

[12] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF

Cannot Refute

[13] Exploiting semantic reconstruction to mitigate hallucinations in vision-language models PDF

Cannot Refute

[14] EFUF: Efficient Fine-Grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models PDF

Cannot Refute

[15] Mca-llava: Manhattan causal attention for reducing hallucination in large vision-language models PDF

Cannot Refute

[16] Detecting and Preventing Hallucinations in Large Vision Language Models PDF

Cannot Refute

[17] Mitigating Image Captioning Hallucinations in Vision-Language Models PDF

Cannot Refute

[18] Woodpecker: hallucination correction for multimodal large language models PDF

Cannot Refute

EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

EmotionHallucer benchmark for emotion hallucination evaluation

[1] Hallucination Augmented Contrastive Learning for Multimodal Large Language Model PDF

[10] A survey on hallucination in large vision-language models PDF

[12] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF

[19] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models PDF

[20] Visual hallucinations of multi-modal large language models PDF

[21] Hallucination of multimodal large language models: A survey PDF

[22] Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models PDF

[23] Retrieve-then-compare mitigates visual hallucination in multi-modal large language models PDF

[24] Evaluating Object Hallucination in Large Vision-Language Models PDF

[25] Unified hallucination detection for multimodal large language models PDF

Comprehensive evaluation of 41 LLMs and MLLMs with three key findings

[26] Lmms-eval: Reality check on the evaluation of large multimodal models PDF

[27] A survey on evaluation of large language models PDF

[28] Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis PDF

[29] A systematic evaluation of large language models of code PDF

[30] Testing and evaluation of health care applications of large language models: a systematic review PDF

[31] ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning PDF

[32] A survey on multimodal large language models for autonomous driving PDF

[33] A survey on multimodal large language models PDF

[34] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

[35] Seed-bench: Benchmarking multimodal large language models PDF

PEP-MEK framework for mitigating emotion hallucinations

[9] Skip\n: A simple method to reduce hallucination in large vision-language models PDF

[10] A survey on hallucination in large vision-language models PDF

[11] Analyzing and mitigating object hallucination in large vision-language models PDF

[12] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF

[13] Exploiting semantic reconstruction to mitigate hallucinations in vision-language models PDF

[14] EFUF: Efficient Fine-Grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models PDF

[15] Mca-llava: Manhattan causal attention for reducing hallucination in large vision-language models PDF

[16] Detecting and Preventing Hallucinations in Large Vision Language Models PDF

[17] Mitigating Image Captioning Hallucinations in Vision-Language Models PDF

[18] Woodpecker: hallucination correction for multimodal large language models PDF

Table of Contents