EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Speech Emotion RecognitionSpeech LLMsSpeech ProcessingReinforcement Learning
Abstract:

Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs’ expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR}) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces EmotionThinker, which reformulates speech emotion recognition as a reasoning problem using reinforcement learning to generate interpretable explanations grounded in acoustic cues. It resides in the 'Reinforcement Learning for Reasoning' leaf under 'Reasoning-Based Approaches', alongside only two sibling papers (EMO-RL and R1-omni). This leaf represents a sparse, emerging research direction within the broader taxonomy of 50 papers across 36 topics, indicating that RL-driven reasoning for emotion understanding remains relatively unexplored compared to attention-based explainability or multimodal fusion approaches.

The taxonomy reveals that neighboring leaves focus on Chain-of-Thought generative reasoning (without RL) and Multimodal Reasoning Frameworks, while the parent branch 'Reasoning-Based Approaches' contrasts with 'Explainability Techniques' that emphasize post-hoc analysis tools like SHAP and LIME. EmotionThinker bridges reasoning generation and explainability by training models to articulate logic during inference, diverging from purely attention-driven methods in adjacent branches. The sparse population of its leaf suggests this RL-for-reasoning direction is less crowded than feature-level explainability or transformer-based multimodal fusion, which contain four to six papers each.

Among 20 candidates examined across three contributions, none were found to clearly refute the paper's claims. The EmotionCoT-35K dataset examined 3 candidates with 0 refutable; the prosody-enhanced foundation model examined 10 candidates with 0 refutable; and the GRPO-PTR framework examined 7 candidates with 0 refutable. This limited search scope—top-K semantic matches plus citation expansion—suggests that within the examined literature, no prior work directly overlaps with the combination of prosody enhancement, CoT annotations, and RL-based reasoning rewards. However, the small candidate pool means the analysis cannot confirm exhaustive novelty across the entire field.

Based on the 20-candidate search, the work appears to occupy a relatively novel position at the intersection of RL-driven reasoning and prosody-aware emotion understanding. The sparse taxonomy leaf and absence of refutable candidates within the examined scope suggest incremental but meaningful differentiation from existing methods. A broader literature search or deeper examination of the two sibling papers' technical details would be needed to assess whether the prosody enhancement and trust-aware reward mechanisms constitute substantial advances over prior RL-based reasoning frameworks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: explainable speech emotion recognition with reasoning. The field organizes around several complementary branches that address different facets of understanding and justifying emotion predictions from speech. Explainability Techniques and Interpretability Methods focus on making model decisions transparent through attention mechanisms, saliency maps, and post-hoc analysis tools such as LIME and SHAP, as seen in works like XAI Speech Techniques[11] and Distribution-Shift LIME[44]. Reasoning-Based Approaches emphasize structured inference and logical chains, often leveraging reinforcement learning or chain-of-thought prompting to produce human-understandable rationales. Multimodal Integration combines acoustic signals with text, video, or physiological data to enrich emotion understanding, while Speech-Only Emotion Recognition concentrates on purely acoustic feature extraction and modeling. Conversational and Contextual Emotion Understanding examines dialogue history and speaker interactions, and Cross-Lingual Emotion Detection Tasks explore generalization across languages. Finally, Affective Computing Foundations and Applications ground the technical work in real-world scenarios such as mental health monitoring and human-computer interaction. Within Reasoning-Based Approaches, a small but growing cluster explores reinforcement learning to guide models toward interpretable decision paths. EmotionThinker[0] exemplifies this direction by training agents to generate step-by-step reasoning traces that justify emotion labels, aligning closely with EMO-RL[37] and R1-omni[6], which similarly apply RL frameworks to refine reasoning quality. These methods contrast with purely attention-driven explainability (e.g., Multimodal Explainable Emotion[1]) by explicitly optimizing for coherent rationales rather than relying on post-hoc visualization. A key trade-off is computational cost versus interpretability depth: RL-based reasoning can yield richer explanations but requires careful reward design and longer training. EmotionThinker[0] sits at the intersection of reasoning and explainability, offering a middle ground where the model learns to articulate its logic during inference, bridging the gap between black-box performance and human-centered transparency.

Claimed Contributions

EmotionCoT-35K dataset with prosody-aware Chain-of-Thought annotations

The authors curate a training dataset of 35,000 speech–reasoning pairs spanning about 200 hours of audio, annotated with emotion labels and fine-grained prosodic features (pitch, energy, speed, stress, intonation) plus step-wise reasoning traces. This dataset enables models to produce both emotion labels and perceptually grounded explanations.

3 retrieved papers
Prosody-enhanced foundation model EmotionThinker-Base

The authors build a foundation model via prosody-centric supervised fine-tuning on approximately 500 hours of data, including stress perception, prosodic attribute classification, and comparative prosodic augmentation tasks. This stage equips the model with strong prosody perception ability before reinforcement learning.

10 retrieved papers
GRPO-PTR reinforcement learning framework with progressive trust-aware reasoning reward

The authors propose a reinforcement learning strategy that progressively introduces a reasoning reward model trained on multi-dimensional criteria (factual alignment, interpretative quality, caption completeness, fluency) and dynamically adjusts it with a trustworthiness weight reflecting reasoning-outcome alignment. This approach supervises intermediate reasoning quality and mitigates reward hacking.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EmotionCoT-35K dataset with prosody-aware Chain-of-Thought annotations

The authors curate a training dataset of 35,000 speech–reasoning pairs spanning about 200 hours of audio, annotated with emotion labels and fine-grained prosodic features (pitch, energy, speed, stress, intonation) plus step-wise reasoning traces. This dataset enables models to produce both emotion labels and perceptually grounded explanations.

Contribution

Prosody-enhanced foundation model EmotionThinker-Base

The authors build a foundation model via prosody-centric supervised fine-tuning on approximately 500 hours of data, including stress perception, prosodic attribute classification, and comparative prosodic augmentation tasks. This stage equips the model with strong prosody perception ability before reinforcement learning.

Contribution

GRPO-PTR reinforcement learning framework with progressive trust-aware reasoning reward

The authors propose a reinforcement learning strategy that progressively introduces a reasoning reward model trained on multi-dimensional criteria (factual alignment, interpretative quality, caption completeness, fluency) and dynamically adjusts it with a trustworthiness weight reflecting reasoning-outcome alignment. This approach supervises intermediate reasoning quality and mitigates reward hacking.