Unveiling the Cognitive Compass: Theory-of-Mind–Guided Multimodal Emotion Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Affective ComputingMultimodal Understanding and ReasoningReinforcement Learning
Abstract:

Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HitEmotion, a hierarchical benchmark for diagnosing emotional reasoning capabilities at increasing cognitive depths, alongside a ToM-guided reasoning chain and TMPO reinforcement learning method. It resides in the 'Cognitive Architecture for ToM-Enhanced Emotion Processing' leaf, which contains only three papers total, including this one. This leaf sits within the broader 'Theory-of-Mind Reasoning Frameworks and Architectures' branch, indicating the work addresses architectural design rather than pure benchmarking or application deployment. The sparse population of this specific leaf suggests the integration of ToM principles into cognitive architectures for emotion processing remains an emerging research direction.

The taxonomy reveals neighboring leaves focused on Bayesian probabilistic reasoning and interpretability analysis, while sibling branches address benchmark development and application domains like strategic games and embodied agents. The paper's positioning bridges multiple concerns: it contributes both a benchmark (typically housed in the evaluation branch) and architectural innovations (reasoning chains, TMPO training). This cross-cutting nature distinguishes it from purely benchmark-focused efforts like MMToM-QA or purely application-driven work in negotiation scenarios. The scope notes clarify that this leaf excludes application-specific implementations and pure evaluation studies, positioning the work as foundational framework development with accompanying diagnostic tools.

Among the 22 candidates examined through semantic search, none clearly refute any of the three contributions. The HitEmotion benchmark examined 2 candidates with no refutations, suggesting limited prior work on hierarchical ToM-grounded emotion evaluation. The ToM-guided reasoning chain and TMPO method each examined 10 candidates with no refutations, indicating these specific technical approaches appear novel within the search scope. However, the analysis explicitly notes this reflects a limited top-K semantic search rather than exhaustive coverage, meaning the absence of refutations should be interpreted cautiously as evidence of novelty within the examined sample rather than definitive proof of field-wide originality.

Given the sparse taxonomy leaf and absence of refutations among 22 examined candidates, the work appears to occupy relatively unexplored territory at the intersection of ToM cognitive architectures and emotion reasoning. The hierarchical benchmark structure and process-level supervision via mental states represent distinctive technical choices. However, the limited search scope means potentially relevant work in adjacent areas—such as emotion explanation systems or multi-agent ToM benchmarks—may not have been fully captured, and broader literature on reinforcement learning from intermediate reasoning steps could provide additional context for assessing the TMPO contribution's novelty.

Taxonomy

Core-task Taxonomy Papers
20
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Theory-of-Mind guided multimodal emotion reasoning. This emerging field seeks to integrate cognitive models of mental state attribution with multimodal signal processing to enable richer emotion understanding. The taxonomy reveals four main branches: reasoning frameworks and architectures that design computational mechanisms for ToM-enhanced emotion processing; multimodal benchmarks and evaluation datasets that provide standardized testbeds for assessing ToM capabilities across vision, language, and other modalities; application domains spanning negotiation, human-computer interaction, and clinical settings; and cognitive and clinical perspectives that ground computational work in psychological theory. Representative efforts include MMToM QA[7] and Muma ToM[2] for benchmark development, Multimind Werewolf[3] for strategic reasoning in social games, and Emotional Theory Mind[4] for affective state inference. These branches collectively illustrate how ToM reasoning can be operationalized through diverse modalities and task contexts. Recent work highlights contrasts between symbolic cognitive architectures and data-driven learning approaches, as well as trade-offs between interpretability and scalability. Some studies emphasize Bayesian or probabilistic frameworks for belief modeling, such as Weak to Strong Bayesian[1] and Bayesian Planner ToM[14], while others leverage neural architectures for end-to-end multimodal fusion. Cognitive Compass[0] sits within the cognitive architecture branch, proposing a structured framework for ToM-enhanced emotion processing that integrates multiple reasoning modules. Compared to nearby works like Consistency Uncertainty Detection[11], which focuses on uncertainty estimation in mental state inference, and Modeling ToM HCI[12], which targets interactive system design, Cognitive Compass[0] emphasizes a holistic cognitive architecture that coordinates perception, reasoning, and affective interpretation. This positioning reflects ongoing debates about whether ToM-guided emotion reasoning is best achieved through modular symbolic systems or through tightly integrated neural models.

Claimed Contributions

HitEmotion: ToM-grounded hierarchical benchmark for multimodal emotion understanding

The authors present HitEmotion, a benchmark that systematically organizes 24 emotion-related tasks into three hierarchical levels (Emotion Perception and Recognition, Emotion Understanding and Analysis, and Emotion Cognition and Reasoning) grounded in Theory of Mind principles. This structure enables precise measurement of model capability breakpoints at different cognitive depths.

2 retrieved papers
ToM-guided reasoning chain for faithful emotional reasoning

The authors develop a structured reasoning approach based on Theory of Mind that explicitly tracks mental states and integrates multimodal evidence. This method aims to shift models from superficial pattern matching to deeper mental state simulation for more faithful emotional understanding.

10 retrieved papers
TMPO: Theory-of-Mind preference optimization method

The authors propose TMPO, a novel reinforcement learning framework that leverages intermediate mental states from ToM-based reasoning chains as process-level supervision. This method combines supervised fine-tuning with group-wise reward policy optimization to transform reasoning from a general emergent ability into a domain-acquired skill.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HitEmotion: ToM-grounded hierarchical benchmark for multimodal emotion understanding

The authors present HitEmotion, a benchmark that systematically organizes 24 emotion-related tasks into three hierarchical levels (Emotion Perception and Recognition, Emotion Understanding and Analysis, and Emotion Cognition and Reasoning) grounded in Theory of Mind principles. This structure enables precise measurement of model capability breakpoints at different cognitive depths.

Contribution

ToM-guided reasoning chain for faithful emotional reasoning

The authors develop a structured reasoning approach based on Theory of Mind that explicitly tracks mental states and integrates multimodal evidence. This method aims to shift models from superficial pattern matching to deeper mental state simulation for more faithful emotional understanding.

Contribution

TMPO: Theory-of-Mind preference optimization method

The authors propose TMPO, a novel reinforcement learning framework that leverages intermediate mental states from ToM-based reasoning chains as process-level supervision. This method combines supervised fine-tuning with group-wise reward policy optimization to transform reasoning from a general emergent ability into a domain-acquired skill.