AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

MLLMEmotion RecognitionMultimodal Reasoning

Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models (MLLMs) have shown strong performance on this task, two key challenges remain: (i) spurious associations between emotions and irrelevant audiovisual cues and (ii) hallucination of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue–emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over (i) responses exhibiting spurious associations or hallucinations and (ii) audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models (6-19% of relative performance) in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EmoReAlM, a benchmark for evaluating spurious cue–emotion associations and hallucinations in multimodal large language models, alongside AVEm-DPO, a preference optimization technique that aligns model responses with audiovisual inputs and emotion-centric queries. Within the taxonomy, it resides in the Preference Optimization and Alignment leaf under Training Strategies and Optimization. This leaf contains only two papers total, indicating a relatively sparse research direction compared to more crowded areas like General Fusion Architectures or Comprehensive Emotion Understanding Benchmarks, which house four or more sibling works.

The taxonomy reveals that neighboring leaves—Instruction Tuning and Prompting, Knowledge Distillation and Transfer, and Contrastive and Representation Learning—collectively address training methodologies but do not explicitly target preference-based alignment for emotion reasoning. The Modality Conflict and Alignment Handling branch under Frameworks and Architectures focuses on cross-modal inconsistencies at the architectural level, whereas this work tackles alignment through optimization objectives. The scope_note for Preference Optimization emphasizes methods using DPO or alignment techniques, distinguishing it from instruction tuning without preference mechanisms, which clarifies that the paper's dual focus on benchmark and optimization sits at the intersection of evaluation and training innovation.

Among seventeen candidates examined, the EmoReAlM benchmark contribution shows one refutable candidate out of three examined, suggesting some overlap with existing evaluation frameworks. The AVEm-DPO technique examined ten candidates with one refutable match, indicating that while preference optimization for emotion reasoning is explored, the specific audiovisual alignment strategy may offer incremental novelty. The text-prior debiasing regularization examined four candidates with two refutable matches, pointing to more substantial prior work on mitigating language model biases. Given the limited search scope—seventeen candidates total, not hundreds—these statistics reflect a snapshot rather than exhaustive coverage, and the relatively low refutation counts suggest the contributions occupy a less saturated niche within the broader emotion MLLM landscape.

Overall, the work appears to address a genuine gap in preference-based alignment for audiovisual emotion reasoning, particularly given the sparse Preference Optimization leaf. However, the benchmark and debiasing components show moderate overlap with existing evaluation and bias-mitigation efforts based on the limited candidate pool. The analysis covers top-K semantic matches and does not claim comprehensive field coverage, so additional related work may exist beyond the seventeen candidates examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: audiovisual emotion reasoning with multimodal large language models. The field has evolved around several complementary directions. Emotion Understanding Frameworks and Architectures explore foundational model designs that integrate visual, auditory, and textual cues for affective interpretation, with works like Affectgpt[2] and Emollm[5] exemplifying early MLLM-based approaches. Training Strategies and Optimization address how to effectively align these models through preference optimization, instruction tuning, and domain adaptation, while Benchmarks and Evaluation Frameworks provide standardized testbeds such as MME-emotion[15] and OmniBench[35] to measure progress. Specialized Emotion Understanding Tasks focus on nuanced challenges like emotion cause extraction, conversational emotion recognition, and explainable reasoning, whereas Robustness and Noise Handling tackle real-world degradation scenarios. Application-Oriented Systems translate these capabilities into practical domains such as psychological counseling and lie detection, and Survey and Review Studies like MLLM Emotion Survey[6] synthesize emerging trends across the landscape. Recent efforts reveal a tension between end-to-end multimodal reasoning and modular perception-then-reasoning pipelines. Many studies emphasize holistic audiovisual fusion within a single MLLM backbone, as seen in Omni-emotion[16] and EmoVerse[19], while others adopt staged architectures that first extract modality-specific features before feeding them to language models. Within the Training Strategies and Optimization branch, AVERE[0] situates itself in the Preference Optimization and Alignment cluster, focusing on aligning model outputs with human affective judgments through techniques that refine reasoning consistency. This contrasts with nearby works like Reliable Reasoning Priors[48], which emphasizes incorporating structured prior knowledge to improve inference reliability, and Emotion-Coherent Reasoning[49], which targets coherence across multi-turn interactions. AVERE[0] thus addresses a critical gap in ensuring that MLLMs not only recognize emotions but also reason about them in ways that align with nuanced human preferences, bridging perceptual accuracy and interpretive fidelity.

Claimed Contributions

EmoReAlM benchmark for evaluating emotion reasoning and hallucinations in MLLMs

Can Refute

3 retrieved papers

The authors present EmoReAlM, a benchmark containing 4000 human-verified multiple-choice question-answer samples that systematically evaluates multimodal large language models on emotion reasoning, modality agreement, spurious audiovisual cue associations, and emotion-related hallucinations.

3 retrieved papers

Can Refute

AVEm-DPO preference optimization technique for audiovisual emotion reasoning

Can Refute

10 retrieved papers

The authors introduce AVEm-DPO, a direct preference optimization method that constructs preferences over responses with spurious associations or hallucinations and over audiovisual input pairs guided by textual prompts, while including a regularization term to reduce reliance on text priors.

10 retrieved papers

Can Refute

Text-prior debiasing regularization to mitigate modality-specific hallucinations

Can Refute

4 retrieved papers

The authors develop a text-prior debiasing mechanism that penalizes the policy reward for responses generated from text-only inputs, reducing hallucinations caused by language model biases that associate commonly co-occurring cues with specific emotions.

4 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[48] Multimodal Video Emotion Recognition with Reliable Reasoning Priors PDF

Wang, Zhepeng, Zhu Yingjian, Chen Feng, Wang Xinming, Xie Jun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EmoReAlM benchmark for evaluating emotion reasoning and hallucinations in MLLMs

[59] Emotionhallucer: Evaluating emotion hallucinations in multimodal large language models PDF

Can Refute

[11] Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning PDF

Cannot Refute

[60] Hallucination Is All You Need: Using Generative Models for Test Time Data Augmentation PDF

Cannot Refute

Contribution

AVEm-DPO preference optimization technique for audiovisual emotion reasoning

[58] OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination PDF

Can Refute

[10] Multimodal Emotion Cause Pair Extraction in Conversations Using Knowledge Distillation and Large Language Models PDF

Cannot Refute

[12] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning PDF

Cannot Refute

[51] A review of key technologies for emotion analysis using multimodal information PDF

Cannot Refute

[52] Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor PDF

Cannot Refute

[53] Cat+: Investigating and enhancing audio-visual understanding in large language models PDF

Cannot Refute

[54] AlignCap: Aligning Speech Emotion Captioning to Human Preferences PDF

Cannot Refute

[55] Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization PDF

Cannot Refute

[56] Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels PDF

Cannot Refute

[57] Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modelling and User Preference Learning PDF

Cannot Refute

Contribution

Text-prior debiasing regularization to mitigate modality-specific hallucinations

[63] Debiasing multimodal large language models via penalization of language priors PDF

Can Refute

[64] Debiasing multimodal large language models PDF

Can Refute

[61] Hallucination of multimodal large language models: A survey PDF

Cannot Refute

[62] A comprehensive survey of hallucination mitigation techniques in large language models PDF

Cannot Refute

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[48] Multimodal Video Emotion Recognition with Reliable Reasoning Priors PDF

Contribution Analysis

EmoReAlM benchmark for evaluating emotion reasoning and hallucinations in MLLMs

[59] Emotionhallucer: Evaluating emotion hallucinations in multimodal large language models PDF

[11] Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning PDF

[60] Hallucination Is All You Need: Using Generative Models for Test Time Data Augmentation PDF

AVEm-DPO preference optimization technique for audiovisual emotion reasoning

[58] OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination PDF

[10] Multimodal Emotion Cause Pair Extraction in Conversations Using Knowledge Distillation and Large Language Models PDF

[12] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning PDF

[51] A review of key technologies for emotion analysis using multimodal information PDF

[52] Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor PDF

[53] Cat+: Investigating and enhancing audio-visual understanding in large language models PDF

[54] AlignCap: Aligning Speech Emotion Captioning to Human Preferences PDF

[55] Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization PDF

[56] Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels PDF

[57] Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modelling and User Preference Learning PDF

Text-prior debiasing regularization to mitigate modality-specific hallucinations

[63] Debiasing multimodal large language models via penalization of language priors PDF

[64] Debiasing multimodal large language models PDF

[61] Hallucination of multimodal large language models: A survey PDF

[62] A comprehensive survey of hallucination mitigation techniques in large language models PDF

Table of Contents