AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
MLLMEmotion RecognitionMultimodal Reasoning
Abstract:

Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models (MLLMs) have shown strong performance on this task, two key challenges remain: (i) spurious associations between emotions and irrelevant audiovisual cues and (ii) hallucination of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue–emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over (i) responses exhibiting spurious associations or hallucinations and (ii) audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models (6-19% of relative performance) in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EmoReAlM, a benchmark for evaluating spurious cue–emotion associations and hallucinations in multimodal large language models, alongside AVEm-DPO, a preference optimization technique that aligns model responses with audiovisual inputs and emotion-centric queries. Within the taxonomy, it resides in the Preference Optimization and Alignment leaf under Training Strategies and Optimization. This leaf contains only two papers total, indicating a relatively sparse research direction compared to more crowded areas like General Fusion Architectures or Comprehensive Emotion Understanding Benchmarks, which house four or more sibling works.

The taxonomy reveals that neighboring leaves—Instruction Tuning and Prompting, Knowledge Distillation and Transfer, and Contrastive and Representation Learning—collectively address training methodologies but do not explicitly target preference-based alignment for emotion reasoning. The Modality Conflict and Alignment Handling branch under Frameworks and Architectures focuses on cross-modal inconsistencies at the architectural level, whereas this work tackles alignment through optimization objectives. The scope_note for Preference Optimization emphasizes methods using DPO or alignment techniques, distinguishing it from instruction tuning without preference mechanisms, which clarifies that the paper's dual focus on benchmark and optimization sits at the intersection of evaluation and training innovation.

Among seventeen candidates examined, the EmoReAlM benchmark contribution shows one refutable candidate out of three examined, suggesting some overlap with existing evaluation frameworks. The AVEm-DPO technique examined ten candidates with one refutable match, indicating that while preference optimization for emotion reasoning is explored, the specific audiovisual alignment strategy may offer incremental novelty. The text-prior debiasing regularization examined four candidates with two refutable matches, pointing to more substantial prior work on mitigating language model biases. Given the limited search scope—seventeen candidates total, not hundreds—these statistics reflect a snapshot rather than exhaustive coverage, and the relatively low refutation counts suggest the contributions occupy a less saturated niche within the broader emotion MLLM landscape.

Overall, the work appears to address a genuine gap in preference-based alignment for audiovisual emotion reasoning, particularly given the sparse Preference Optimization leaf. However, the benchmark and debiasing components show moderate overlap with existing evaluation and bias-mitigation efforts based on the limited candidate pool. The analysis covers top-K semantic matches and does not claim comprehensive field coverage, so additional related work may exist beyond the seventeen candidates examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
17
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: audiovisual emotion reasoning with multimodal large language models. The field has evolved around several complementary directions. Emotion Understanding Frameworks and Architectures explore foundational model designs that integrate visual, auditory, and textual cues for affective interpretation, with works like Affectgpt[2] and Emollm[5] exemplifying early MLLM-based approaches. Training Strategies and Optimization address how to effectively align these models through preference optimization, instruction tuning, and domain adaptation, while Benchmarks and Evaluation Frameworks provide standardized testbeds such as MME-emotion[15] and OmniBench[35] to measure progress. Specialized Emotion Understanding Tasks focus on nuanced challenges like emotion cause extraction, conversational emotion recognition, and explainable reasoning, whereas Robustness and Noise Handling tackle real-world degradation scenarios. Application-Oriented Systems translate these capabilities into practical domains such as psychological counseling and lie detection, and Survey and Review Studies like MLLM Emotion Survey[6] synthesize emerging trends across the landscape. Recent efforts reveal a tension between end-to-end multimodal reasoning and modular perception-then-reasoning pipelines. Many studies emphasize holistic audiovisual fusion within a single MLLM backbone, as seen in Omni-emotion[16] and EmoVerse[19], while others adopt staged architectures that first extract modality-specific features before feeding them to language models. Within the Training Strategies and Optimization branch, AVERE[0] situates itself in the Preference Optimization and Alignment cluster, focusing on aligning model outputs with human affective judgments through techniques that refine reasoning consistency. This contrasts with nearby works like Reliable Reasoning Priors[48], which emphasizes incorporating structured prior knowledge to improve inference reliability, and Emotion-Coherent Reasoning[49], which targets coherence across multi-turn interactions. AVERE[0] thus addresses a critical gap in ensuring that MLLMs not only recognize emotions but also reason about them in ways that align with nuanced human preferences, bridging perceptual accuracy and interpretive fidelity.

Claimed Contributions

EmoReAlM benchmark for evaluating emotion reasoning and hallucinations in MLLMs

The authors present EmoReAlM, a benchmark containing 4000 human-verified multiple-choice question-answer samples that systematically evaluates multimodal large language models on emotion reasoning, modality agreement, spurious audiovisual cue associations, and emotion-related hallucinations.

3 retrieved papers
Can Refute
AVEm-DPO preference optimization technique for audiovisual emotion reasoning

The authors introduce AVEm-DPO, a direct preference optimization method that constructs preferences over responses with spurious associations or hallucinations and over audiovisual input pairs guided by textual prompts, while including a regularization term to reduce reliance on text priors.

10 retrieved papers
Can Refute
Text-prior debiasing regularization to mitigate modality-specific hallucinations

The authors develop a text-prior debiasing mechanism that penalizes the policy reward for responses generated from text-only inputs, reducing hallucinations caused by language model biases that associate commonly co-occurring cues with specific emotions.

4 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EmoReAlM benchmark for evaluating emotion reasoning and hallucinations in MLLMs

The authors present EmoReAlM, a benchmark containing 4000 human-verified multiple-choice question-answer samples that systematically evaluates multimodal large language models on emotion reasoning, modality agreement, spurious audiovisual cue associations, and emotion-related hallucinations.

Contribution

AVEm-DPO preference optimization technique for audiovisual emotion reasoning

The authors introduce AVEm-DPO, a direct preference optimization method that constructs preferences over responses with spurious associations or hallucinations and over audiovisual input pairs guided by textual prompts, while including a regularization term to reduce reliance on text priors.

Contribution

Text-prior debiasing regularization to mitigate modality-specific hallucinations

The authors develop a text-prior debiasing mechanism that penalizes the policy reward for responses generated from text-only inputs, reducing hallucinations caused by language model biases that associate commonly co-occurring cues with specific emotions.