Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

RLVRLarge Language ModelRisk-Sensitive Reinforcement Learning

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. Yet current methods face an exploration dilemma: standard RL struggles to escape the local optima of pre-trained LLMs’ sharply peaked initial policies, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. We address this with a Risk-Sensitive Reinforcement Learning framework. By adopting a risk-seeking objective that interpolates between mean and maximum rewards, we derive a novel Risk-Sensitive GRPO (RS-GRPO) algorithm that emphasizes hard prompts to drive exploration. Across six mathematical reasoning benchmarks and five LLMs, RS-GRPO consistently improves pass@k performance while enhancing or maintaing pass@1.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a risk-sensitive reinforcement learning framework to address the exploration-exploitation dilemma in RLVR for LLMs, introducing the RS-GRPO algorithm that interpolates between mean and maximum rewards to enhance solution diversity. It resides in the 'Risk-Sensitive and Objective-Based Exploration' leaf, which contains four papers total, indicating a moderately sparse research direction within the broader 'Exploration Strategy Design and Mechanisms' branch. This leaf focuses specifically on modifying RL objectives to drive exploration, distinguishing it from intrinsic motivation approaches that rely on curiosity signals or uncertainty estimates.

The taxonomy reveals neighboring leaves addressing exploration through intrinsic rewards (four papers on curiosity-driven methods) and adaptive control mechanisms (three papers on dynamic exploration adjustment). The paper's risk-seeking objective diverges from these by explicitly balancing mean and maximum reward rather than relying on novelty bonuses or adaptive schedules. The broader 'Knowledge-Guided and Semantic Exploration' branch (nine papers across three leaves) represents an alternative paradigm using external knowledge or LLM-generated subgoals, while the paper's approach remains within objective-based exploration without external scaffolding. The scope note clarifies that standard policy gradient methods without exploration-specific modifications belong elsewhere, positioning this work as a deliberate departure from conventional RLVR training.

Among nineteen candidates examined across three contributions, none were identified as clearly refuting the work. The 'Risk-Sensitive Framework' contribution examined eight candidates with zero refutable matches, while 'RS-GRPO Algorithm' examined nine with similar results. The 'Theoretical Analysis' contribution reviewed two candidates, also finding no overlapping prior work. This limited search scope—focused on top-K semantic matches and citation expansion—suggests the specific combination of risk-sensitive objectives and GRPO adaptation for LLM exploration appears novel within the examined literature. However, the analysis does not claim exhaustive coverage of all possible prior work in risk-sensitive RL or exploration methods.

The work appears to occupy a distinct position within a moderately explored research direction, combining risk-aware objective design with practical algorithmic instantiation for LLM reasoning tasks. The absence of refuting candidates among nineteen examined suggests novelty in the specific technical approach, though the limited search scope means broader connections to risk-sensitive RL literature outside the LLM-exploration context may exist. The taxonomy structure indicates this direction is less crowded than intrinsic motivation or knowledge-guided approaches, potentially offering room for methodological contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Enhancing exploration in reinforcement learning for large language models. The field organizes itself around several complementary perspectives. Exploration Strategy Design and Mechanisms encompasses foundational techniques such as curiosity-driven methods (Curiosity-Driven Exploration LLM[4]) and risk-sensitive objectives (Risk-Sensitive RL LLM[0]), while Knowledge-Guided and Semantic Exploration leverages linguistic structure and prior knowledge to direct search (Language Guided Exploration[44]). Memory and Experience-Based Exploration focuses on replay mechanisms (Retrospective Replay[10], RLEP Experience Replay[31]) that reuse past trajectories, and Hierarchical and Structured Exploration addresses multi-level decision-making (Algorithm of Thoughts[15], RL of Thoughts[34]). Training Efficiency and Sample Optimization targets reducing computational overhead (Enhancing Efficiency Exploration RL[3], Efficient Exploration LLMs[6]), while Theoretical Analysis and Empirical Evaluation provides rigorous benchmarks (Survey RL Complex Environments[23]). Application-Specific Exploration tailors methods to domains like tool use (Retool Strategic Tool Use[8]) or recommendation (Optimizing Novelty Recommendations[30]), and Surveys and Integrative Frameworks synthesize cross-cutting insights. Within Exploration Strategy Design, a central tension emerges between intrinsic motivation approaches that reward novelty (Intrinsic Motivation Exploration[36], Intrinsic Exploration LLM[48]) and objective-based methods that explicitly balance risk and reward. Risk-Sensitive RL LLM[0] sits in this latter cluster, emphasizing controlled exploration under uncertainty—contrasting with purely curiosity-driven schemes like Curiosity-Driven Exploration LLM[4] that prioritize information gain without explicit risk constraints. Nearby works such as Outcome-based Exploration[18] and Unlocking Reasoning Capabilities[27] also shape exploration via task-specific objectives, yet Risk-Sensitive RL LLM[0] distinguishes itself by incorporating risk-awareness into the exploration policy. This positions it as a bridge between classical RL safety concerns and modern LLM fine-tuning, addressing how agents can explore efficiently while respecting distributional or worst-case performance criteria.

Claimed Contributions

Risk-Sensitive Reinforcement Learning Framework for LLMs

8 retrieved papers

The authors propose a risk-sensitive RL framework that uses an exponential utility function to create a risk-seeking objective. This objective interpolates between optimizing mean reward and maximum reward, enabling policies to escape local optima induced by sharply peaked pretrained LLM distributions and discover more diverse reasoning strategies.

8 retrieved papers

RS-GRPO Algorithm

9 retrieved papers

The authors instantiate their risk-sensitive framework as RS-GRPO, a simple algorithm requiring only minor code modifications to existing GRPO implementations. It uses a risk-sensitive advantage function that dynamically re-weights optimization to emphasize hard prompts where the model performs poorly.

9 retrieved papers

Theoretical and Empirical Analysis of Exploration Dilemma

2 retrieved papers

The authors provide both theoretical proofs (showing standard policy gradient can decrease optimal action probability while risk-sensitive gradient guarantees improvement) and empirical demonstrations (bandit experiments) that standard RL fails to escape local optima from sharply peaked initial policies, while their risk-sensitive approach succeeds.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Knapsack rl: Unlocking exploration of llms via optimizing budget allocation PDF

Li, Ziniu, Chen CongLiang, Yang Tianyun, Ding Tian, Sun Ruoyu, Zhang Ge, Huang Wen-Hao, Luo, Zhi-Quan (2025)

[18] Outcome-based exploration for llm reasoning PDF

Song Yu-da, Kempe, Julia, Yuda Song, Munos, RÃ©mi, Julia Kempe, RÃ©mi Munos (2025)

[27] Unlocking reasoning capabilities in llms via reinforcement learning exploration PDF

Deng, Wenhao, Wei Long, Wenhao Deng, Yu Chenglei, Long Wei, Wu, Tailin, Chenglei Yu, Tailin Wu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Risk-Sensitive Reinforcement Learning Framework for LLMs

[51] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models PDF

Cannot Refute

[61] State-aware perturbation optimization for robust deep reinforcement learning PDF

Cannot Refute

[62] Inductive biases in machine learning for robotics and control PDF

Cannot Refute

[63] Revisiting domain randomization via relaxed state-adversarial policy optimization PDF

Cannot Refute

[64] Bayesian robust optimization for imitation learning PDF

Cannot Refute

[65] Safe exploration techniques for reinforcement learningâan overview PDF

Cannot Refute

[66] Bridging Distributional and Risk-Sensitive Reinforcement Learning: Balancing Statistical, Computational, and Risk Considerations PDF

Cannot Refute

[67] Risk-Aware Hierarchical Reinforcement Learning for Long-Range Autonomous Navigation in Off-Road Environments PDF

Cannot Refute

Contribution

RS-GRPO Algorithm

[51] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models PDF

Cannot Refute

[53] DSAC: Distributional Soft Actor-Critic for Risk-Sensitive Reinforcement Learning PDF

Cannot Refute

[54] AdaRisk: risk-adaptive deep reinforcement learning for vulnerable nodes detection PDF

Cannot Refute

[55] Risk-aware Direct Preference Optimization under Nested Risk Measure PDF

Cannot Refute

[56] Risk-sensitive policy optimization via predictive CVaR policy gradient PDF

Cannot Refute

[57] A risk-sensitive approach to policy optimization PDF

Cannot Refute

[58] Risk-Aware Financial Portfolio Management with Distributional Deep Deterministic Policy Gradient PDF

Cannot Refute

[59] Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients PDF

Cannot Refute

[60] Policy Gradient Bayesian Robust Optimization for Imitation Learning PDF

Cannot Refute

Contribution

Theoretical and Empirical Analysis of Exploration Dilemma

[51] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models PDF

Cannot Refute

[52] AEAP: A Reinforcement Learning Actor Ensemble Algorithm with Adaptive Pruning PDF

Cannot Refute

Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Knapsack rl: Unlocking exploration of llms via optimizing budget allocation PDF

[18] Outcome-based exploration for llm reasoning PDF

[27] Unlocking reasoning capabilities in llms via reinforcement learning exploration PDF

Contribution Analysis

Risk-Sensitive Reinforcement Learning Framework for LLMs

[51] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models PDF

[61] State-aware perturbation optimization for robust deep reinforcement learning PDF

[62] Inductive biases in machine learning for robotics and control PDF

[63] Revisiting domain randomization via relaxed state-adversarial policy optimization PDF

[64] Bayesian robust optimization for imitation learning PDF

[65] Safe exploration techniques for reinforcement learningâan overview PDF

[66] Bridging Distributional and Risk-Sensitive Reinforcement Learning: Balancing Statistical, Computational, and Risk Considerations PDF

[67] Risk-Aware Hierarchical Reinforcement Learning for Long-Range Autonomous Navigation in Off-Road Environments PDF

RS-GRPO Algorithm

[51] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models PDF

[53] DSAC: Distributional Soft Actor-Critic for Risk-Sensitive Reinforcement Learning PDF

[54] AdaRisk: risk-adaptive deep reinforcement learning for vulnerable nodes detection PDF

[55] Risk-aware Direct Preference Optimization under Nested Risk Measure PDF

[56] Risk-sensitive policy optimization via predictive CVaR policy gradient PDF

[57] A risk-sensitive approach to policy optimization PDF

[58] Risk-Aware Financial Portfolio Management with Distributional Deep Deterministic Policy Gradient PDF

[59] Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients PDF

[60] Policy Gradient Bayesian Robust Optimization for Imitation Learning PDF

Theoretical and Empirical Analysis of Exploration Dilemma

[51] Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models PDF

[52] AEAP: A Reinforcement Learning Actor Ensemble Algorithm with Adaptive Pruning PDF

Table of Contents

[65] Safe exploration techniques for reinforcement learningâan overview PDF