AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
Overview
Overall Novelty Assessment
The paper introduces EmoReAlM, a benchmark for evaluating spurious cue–emotion associations and hallucinations in multimodal large language models, alongside AVEm-DPO, a preference optimization technique that aligns model responses with audiovisual inputs and emotion-centric queries. Within the taxonomy, it resides in the Preference Optimization and Alignment leaf under Training Strategies and Optimization. This leaf contains only two papers total, indicating a relatively sparse research direction compared to more crowded areas like General Fusion Architectures or Comprehensive Emotion Understanding Benchmarks, which house four or more sibling works.
The taxonomy reveals that neighboring leaves—Instruction Tuning and Prompting, Knowledge Distillation and Transfer, and Contrastive and Representation Learning—collectively address training methodologies but do not explicitly target preference-based alignment for emotion reasoning. The Modality Conflict and Alignment Handling branch under Frameworks and Architectures focuses on cross-modal inconsistencies at the architectural level, whereas this work tackles alignment through optimization objectives. The scope_note for Preference Optimization emphasizes methods using DPO or alignment techniques, distinguishing it from instruction tuning without preference mechanisms, which clarifies that the paper's dual focus on benchmark and optimization sits at the intersection of evaluation and training innovation.
Among seventeen candidates examined, the EmoReAlM benchmark contribution shows one refutable candidate out of three examined, suggesting some overlap with existing evaluation frameworks. The AVEm-DPO technique examined ten candidates with one refutable match, indicating that while preference optimization for emotion reasoning is explored, the specific audiovisual alignment strategy may offer incremental novelty. The text-prior debiasing regularization examined four candidates with two refutable matches, pointing to more substantial prior work on mitigating language model biases. Given the limited search scope—seventeen candidates total, not hundreds—these statistics reflect a snapshot rather than exhaustive coverage, and the relatively low refutation counts suggest the contributions occupy a less saturated niche within the broader emotion MLLM landscape.
Overall, the work appears to address a genuine gap in preference-based alignment for audiovisual emotion reasoning, particularly given the sparse Preference Optimization leaf. However, the benchmark and debiasing components show moderate overlap with existing evaluation and bias-mitigation efforts based on the limited candidate pool. The analysis covers top-K semantic matches and does not claim comprehensive field coverage, so additional related work may exist beyond the seventeen candidates examined.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present EmoReAlM, a benchmark containing 4000 human-verified multiple-choice question-answer samples that systematically evaluates multimodal large language models on emotion reasoning, modality agreement, spurious audiovisual cue associations, and emotion-related hallucinations.
The authors introduce AVEm-DPO, a direct preference optimization method that constructs preferences over responses with spurious associations or hallucinations and over audiovisual input pairs guided by textual prompts, while including a regularization term to reduce reliance on text priors.
The authors develop a text-prior debiasing mechanism that penalizes the policy reward for responses generated from text-only inputs, reducing hallucinations caused by language model biases that associate commonly co-occurring cues with specific emotions.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[48] Multimodal Video Emotion Recognition with Reliable Reasoning Priors PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
EmoReAlM benchmark for evaluating emotion reasoning and hallucinations in MLLMs
The authors present EmoReAlM, a benchmark containing 4000 human-verified multiple-choice question-answer samples that systematically evaluates multimodal large language models on emotion reasoning, modality agreement, spurious audiovisual cue associations, and emotion-related hallucinations.
[59] Emotionhallucer: Evaluating emotion hallucinations in multimodal large language models PDF
[11] Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning PDF
[60] Hallucination Is All You Need: Using Generative Models for Test Time Data Augmentation PDF
AVEm-DPO preference optimization technique for audiovisual emotion reasoning
The authors introduce AVEm-DPO, a direct preference optimization method that constructs preferences over responses with spurious associations or hallucinations and over audiovisual input pairs guided by textual prompts, while including a regularization term to reduce reliance on text priors.
[58] OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination PDF
[10] Multimodal Emotion Cause Pair Extraction in Conversations Using Knowledge Distillation and Large Language Models PDF
[12] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning PDF
[51] A review of key technologies for emotion analysis using multimodal information PDF
[52] Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor PDF
[53] Cat+: Investigating and enhancing audio-visual understanding in large language models PDF
[54] AlignCap: Aligning Speech Emotion Captioning to Human Preferences PDF
[55] Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization PDF
[56] Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels PDF
[57] Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modelling and User Preference Learning PDF
Text-prior debiasing regularization to mitigate modality-specific hallucinations
The authors develop a text-prior debiasing mechanism that penalizes the policy reward for responses generated from text-only inputs, reducing hallucinations caused by language model biases that associate commonly co-occurring cues with specific emotions.