Unveiling the Cognitive Compass: Theory-of-Mind–Guided Multimodal Emotion Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal Affective ComputingMultimodal Understanding and ReasoningReinforcement Learning

Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HitEmotion, a hierarchical benchmark for diagnosing emotional reasoning capabilities at increasing cognitive depths, alongside a ToM-guided reasoning chain and TMPO reinforcement learning method. It resides in the 'Cognitive Architecture for ToM-Enhanced Emotion Processing' leaf, which contains only three papers total, including this one. This leaf sits within the broader 'Theory-of-Mind Reasoning Frameworks and Architectures' branch, indicating the work addresses architectural design rather than pure benchmarking or application deployment. The sparse population of this specific leaf suggests the integration of ToM principles into cognitive architectures for emotion processing remains an emerging research direction.

The taxonomy reveals neighboring leaves focused on Bayesian probabilistic reasoning and interpretability analysis, while sibling branches address benchmark development and application domains like strategic games and embodied agents. The paper's positioning bridges multiple concerns: it contributes both a benchmark (typically housed in the evaluation branch) and architectural innovations (reasoning chains, TMPO training). This cross-cutting nature distinguishes it from purely benchmark-focused efforts like MMToM-QA or purely application-driven work in negotiation scenarios. The scope notes clarify that this leaf excludes application-specific implementations and pure evaluation studies, positioning the work as foundational framework development with accompanying diagnostic tools.

Among the 22 candidates examined through semantic search, none clearly refute any of the three contributions. The HitEmotion benchmark examined 2 candidates with no refutations, suggesting limited prior work on hierarchical ToM-grounded emotion evaluation. The ToM-guided reasoning chain and TMPO method each examined 10 candidates with no refutations, indicating these specific technical approaches appear novel within the search scope. However, the analysis explicitly notes this reflects a limited top-K semantic search rather than exhaustive coverage, meaning the absence of refutations should be interpreted cautiously as evidence of novelty within the examined sample rather than definitive proof of field-wide originality.

Given the sparse taxonomy leaf and absence of refutations among 22 examined candidates, the work appears to occupy relatively unexplored territory at the intersection of ToM cognitive architectures and emotion reasoning. The hierarchical benchmark structure and process-level supervision via mental states represent distinctive technical choices. However, the limited search scope means potentially relevant work in adjacent areas—such as emotion explanation systems or multi-agent ToM benchmarks—may not have been fully captured, and broader literature on reinforcement learning from intermediate reasoning steps could provide additional context for assessing the TMPO contribution's novelty.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Theory-of-Mind guided multimodal emotion reasoning. This emerging field seeks to integrate cognitive models of mental state attribution with multimodal signal processing to enable richer emotion understanding. The taxonomy reveals four main branches: reasoning frameworks and architectures that design computational mechanisms for ToM-enhanced emotion processing; multimodal benchmarks and evaluation datasets that provide standardized testbeds for assessing ToM capabilities across vision, language, and other modalities; application domains spanning negotiation, human-computer interaction, and clinical settings; and cognitive and clinical perspectives that ground computational work in psychological theory. Representative efforts include MMToM QA[7] and Muma ToM[2] for benchmark development, Multimind Werewolf[3] for strategic reasoning in social games, and Emotional Theory Mind[4] for affective state inference. These branches collectively illustrate how ToM reasoning can be operationalized through diverse modalities and task contexts. Recent work highlights contrasts between symbolic cognitive architectures and data-driven learning approaches, as well as trade-offs between interpretability and scalability. Some studies emphasize Bayesian or probabilistic frameworks for belief modeling, such as Weak to Strong Bayesian[1] and Bayesian Planner ToM[14], while others leverage neural architectures for end-to-end multimodal fusion. Cognitive Compass[0] sits within the cognitive architecture branch, proposing a structured framework for ToM-enhanced emotion processing that integrates multiple reasoning modules. Compared to nearby works like Consistency Uncertainty Detection[11], which focuses on uncertainty estimation in mental state inference, and Modeling ToM HCI[12], which targets interactive system design, Cognitive Compass[0] emphasizes a holistic cognitive architecture that coordinates perception, reasoning, and affective interpretation. This positioning reflects ongoing debates about whether ToM-guided emotion reasoning is best achieved through modular symbolic systems or through tightly integrated neural models.

Claimed Contributions

HitEmotion: ToM-grounded hierarchical benchmark for multimodal emotion understanding

2 retrieved papers

The authors present HitEmotion, a benchmark that systematically organizes 24 emotion-related tasks into three hierarchical levels (Emotion Perception and Recognition, Emotion Understanding and Analysis, and Emotion Cognition and Reasoning) grounded in Theory of Mind principles. This structure enables precise measurement of model capability breakpoints at different cognitive depths.

2 retrieved papers

ToM-guided reasoning chain for faithful emotional reasoning

10 retrieved papers

The authors develop a structured reasoning approach based on Theory of Mind that explicitly tracks mental states and integrates multimodal evidence. This method aims to shift models from superficial pattern matching to deeper mental state simulation for more faithful emotional understanding.

10 retrieved papers

TMPO: Theory-of-Mind preference optimization method

10 retrieved papers

The authors propose TMPO, a novel reinforcement learning framework that leverages intermediate mental states from ToM-based reasoning chains as process-level supervision. This method combines supervised fine-tuning with group-wise reward policy optimization to transform reasoning from a general emergent ability into a domain-acquired skill.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Consistency, uncertainty or inconsistency detection in multimodal emotion recognition PDF

Alessia Fantini, Giovanni Pilato, Gianpaolo Vitale (2023)

[12] Modeling theory of mind in multimodal HCI PDF

Yifan Zhu, Hannah VanderHoeven, Kenneth Lai, Mariah Bradford, Christopher Tam, Ibrahim Khebour, R. Brutti, Nikhil Krishnaswamy, Yi-Fan Zhu, James Pustejovsky, Richard Brutti (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HitEmotion: ToM-grounded hierarchical benchmark for multimodal emotion understanding

[31] Directions for Computational Theory of Mind: Data, Metrics, Models PDF

Cannot Refute

[32] Affective-CoT: Decomposing Multimodal Emotion Reasoning through a Hierarchical Cognitive Workflow PDF

Cannot Refute

Contribution

ToM-guided reasoning chain for faithful emotional reasoning

[21] Multimodal mental state analysis PDF

Cannot Refute

[22] Emotional Intelligence in Artificial Agents: Leveraging Deep Multimodal Big Data for Contextual Social Interaction and Adaptive Behavioral Modelling PDF

Cannot Refute

[23] Inference-enabled tracking of acute mental stress via multi-modal wearable physiological sensing: A proof-of-concept study PDF

Cannot Refute

[24] Multimodal temporal context network for tracking dynamic changes in emotion PDF

Cannot Refute

[25] Integrating emotion dynamics in mental health: A trimodal framework combining ecological momentary assessment, physiological measurements, and speech â¦ PDF

Cannot Refute

[26] Husformer: A Multimodal Transformer for Multimodal Human State Recognition PDF

Cannot Refute

[27] Multimodal large language models meet multimodal emotion recognition and reasoning: A survey PDF

Cannot Refute

[28] From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition PDF

Cannot Refute

[29] Online Learning Platform of Modern Chinese Course Based on Multimodal Emotion-Aware Adaptive Learning PDF

Cannot Refute

[30] Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition PDF

Cannot Refute

Contribution

TMPO: Theory-of-Mind preference optimization method

[33] Grounded Reinforcement Learning for Visual Reasoning PDF

Cannot Refute

[34] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF

Cannot Refute

[35] Multiagent inverse reinforcement learning via theory of mind reasoning PDF

Cannot Refute

[36] SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning PDF

Cannot Refute

[37] Reflexion: Language agents with verbal reinforcement learning PDF

Cannot Refute

[38] Learning to reason without external rewards PDF

Cannot Refute

[39] Mental modeling of reinforcement learning agents by language models PDF

Cannot Refute

[40] Visual Reinforcement Learning With Self-Supervised 3D Representations PDF

Cannot Refute

[41] Improving model-based reinforcement learning with internal state representations through self-supervision PDF

Cannot Refute

[42] Relational deep reinforcement learning PDF

Cannot Refute

Unveiling the Cognitive Compass: Theory-of-Mind–Guided Multimodal Emotion Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Consistency, uncertainty or inconsistency detection in multimodal emotion recognition PDF

[12] Modeling theory of mind in multimodal HCI PDF

Contribution Analysis

HitEmotion: ToM-grounded hierarchical benchmark for multimodal emotion understanding

[31] Directions for Computational Theory of Mind: Data, Metrics, Models PDF

[32] Affective-CoT: Decomposing Multimodal Emotion Reasoning through a Hierarchical Cognitive Workflow PDF

ToM-guided reasoning chain for faithful emotional reasoning

[21] Multimodal mental state analysis PDF

[22] Emotional Intelligence in Artificial Agents: Leveraging Deep Multimodal Big Data for Contextual Social Interaction and Adaptive Behavioral Modelling PDF

[23] Inference-enabled tracking of acute mental stress via multi-modal wearable physiological sensing: A proof-of-concept study PDF

[24] Multimodal temporal context network for tracking dynamic changes in emotion PDF

[25] Integrating emotion dynamics in mental health: A trimodal framework combining ecological momentary assessment, physiological measurements, and speech â¦ PDF

[26] Husformer: A Multimodal Transformer for Multimodal Human State Recognition PDF

[27] Multimodal large language models meet multimodal emotion recognition and reasoning: A survey PDF

[28] From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition PDF

[29] Online Learning Platform of Modern Chinese Course Based on Multimodal Emotion-Aware Adaptive Learning PDF

[30] Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition PDF

TMPO: Theory-of-Mind preference optimization method

[33] Grounded Reinforcement Learning for Visual Reasoning PDF

[34] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF

[35] Multiagent inverse reinforcement learning via theory of mind reasoning PDF

[36] SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning PDF

[37] Reflexion: Language agents with verbal reinforcement learning PDF

[38] Learning to reason without external rewards PDF

[39] Mental modeling of reinforcement learning agents by language models PDF

[40] Visual Reinforcement Learning With Self-Supervised 3D Representations PDF

[41] Improving model-based reinforcement learning with internal state representations through self-supervision PDF

[42] Relational deep reinforcement learning PDF

Table of Contents

[25] Integrating emotion dynamics in mental health: A trimodal framework combining ecological momentary assessment, physiological measurements, and speech â¦ PDF