EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
Overview
Overall Novelty Assessment
The paper introduces EmotionHallucer, the first dedicated benchmark for detecting emotion-related hallucinations in multimodal large language models. Within the taxonomy, it occupies the 'Emotion Hallucination Benchmarking' leaf under 'Emotion Understanding and Hallucination Evaluation'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the same category. This isolation suggests the work addresses a previously uncharted niche: while general hallucination detection methods exist in parallel branches, none specifically target emotion misattribution through adversarial psychology-grounded evaluation.
The taxonomy reveals neighboring research directions that contextualize this contribution. The parent branch includes human-centric video emotion assessment, cross-lingual bimodal emotion recognition, and appraisal theory-informed prediction—all addressing emotion understanding but without hallucination-focused evaluation frameworks. A sibling top-level branch covers general hallucination mitigation through representation alignment, preference optimization, and causal debiasing. EmotionHallucer bridges these domains by applying hallucination detection principles specifically to affective content, diverging from both generic multimodal benchmarks and pure emotion recognition systems that lack adversarial robustness testing.
Among thirty candidates examined across three contributions, none yielded refutable prior work. The EmotionHallucer benchmark itself (ten candidates examined, zero refutations) appears novel within this search scope, as does the comprehensive evaluation of forty-one models and the PEP-MEK mitigation framework (each ten candidates, zero refutations). This absence of overlapping work aligns with the taxonomy structure showing no sibling papers in the same leaf. However, the limited search scale means unexplored literature beyond top-thirty semantic matches could contain relevant emotion-hallucination studies not captured here.
Based on the examined candidates and taxonomy position, the work occupies a sparse research direction at the intersection of emotion understanding and hallucination evaluation. The lack of sibling papers and zero refutations across contributions suggest substantive novelty within the scope analyzed. Nonetheless, the thirty-candidate search represents a bounded exploration—broader surveys of affective computing or multimodal robustness literature might reveal adjacent efforts not surfaced by semantic similarity alone.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present EmotionHallucer, the first dedicated benchmark to evaluate emotion-related hallucinations in multimodal large language models. It assesses hallucinations from two perspectives: emotion psychology knowledge (factuality hallucination) and real-world multimodal perception (faithfulness hallucination), using an adversarial binary question-answer framework across seven subcategories and four modalities.
The authors evaluate 41 large language models and multimodal large language models on EmotionHallucer, revealing three main findings: most models exhibit substantial emotion hallucination issues, closed-source models outperform open-source ones with reasoning capability providing advantages, and models perform better on emotion psychology knowledge than multimodal emotion perception.
The authors propose the Predict-Explain-Predict with Modality and Emotion Knowledge (PEP-MEK) framework, a plug-and-play method that incorporates modality-specific and emotional knowledge to reduce emotion hallucinations. Experiments show an average improvement of 9.90% in emotion hallucination detection across selected models.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
EmotionHallucer benchmark for emotion hallucination evaluation
The authors present EmotionHallucer, the first dedicated benchmark to evaluate emotion-related hallucinations in multimodal large language models. It assesses hallucinations from two perspectives: emotion psychology knowledge (factuality hallucination) and real-world multimodal perception (faithfulness hallucination), using an adversarial binary question-answer framework across seven subcategories and four modalities.
[1] Hallucination Augmented Contrastive Learning for Multimodal Large Language Model PDF
[10] A survey on hallucination in large vision-language models PDF
[12] Mihbench: Benchmarking and mitigating multi-image hallucinations in multimodal large language models PDF
[19] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models PDF
[20] Visual hallucinations of multi-modal large language models PDF
[21] Hallucination of multimodal large language models: A survey PDF
[22] Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models PDF
[23] Retrieve-then-compare mitigates visual hallucination in multi-modal large language models PDF
[24] Evaluating Object Hallucination in Large Vision-Language Models PDF
[25] Unified hallucination detection for multimodal large language models PDF
Comprehensive evaluation of 41 LLMs and MLLMs with three key findings
The authors evaluate 41 large language models and multimodal large language models on EmotionHallucer, revealing three main findings: most models exhibit substantial emotion hallucination issues, closed-source models outperform open-source ones with reasoning capability providing advantages, and models perform better on emotion psychology knowledge than multimodal emotion perception.
[26] Lmms-eval: Reality check on the evaluation of large multimodal models PDF
[27] A survey on evaluation of large language models PDF
[28] Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis PDF
[29] A systematic evaluation of large language models of code PDF
[30] Testing and evaluation of health care applications of large language models: a systematic review PDF
[31] ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning PDF
[32] A survey on multimodal large language models for autonomous driving PDF
[33] A survey on multimodal large language models PDF
[34] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF
[35] Seed-bench: Benchmarking multimodal large language models PDF
PEP-MEK framework for mitigating emotion hallucinations
The authors propose the Predict-Explain-Predict with Modality and Emotion Knowledge (PEP-MEK) framework, a plug-and-play method that incorporates modality-specific and emotional knowledge to reduce emotion hallucinations. Experiments show an average improvement of 9.90% in emotion hallucination detection across selected models.