SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Safe ExplorationSharpness Aware MinimizationEpistemic Uncertainty

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor’s sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor’s epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Sharpness-Aware Policy Optimization (SHAPO), which uses parameter perturbations as a proxy for epistemic uncertainty to guide safe exploration. It resides in the 'Policy Optimization with Safety Constraints' leaf, which contains four papers including the original work. This leaf sits within the broader 'Algorithm Design for Safe Exploration' branch, indicating a moderately populated research direction focused on integrating safety directly into policy gradient updates rather than using post-hoc corrections or model-based planning.

The taxonomy reveals neighboring approaches that handle safety through different mechanisms. The 'Safety Layer and Action Correction Methods' leaf (three papers) applies analytical filters after policy decisions, while 'Model-Based Safe RL' (two papers) leverages dynamics models for predictive safety. The 'Uncertainty Quantification and Epistemic Safety' branch addresses similar concerns about unknown risks but through conservative critics and probabilistic constraints rather than sharpness-aware updates. SHAPO's epistemic uncertainty framing connects conceptually to this branch, though it operationalizes uncertainty differently via parameter sensitivity rather than explicit distributional modeling.

Among thirteen candidates examined across three contributions, none were identified as clearly refuting the work. The core SHAPO method examined five candidates with zero refutations, while the analytical gradient reweighting characterization examined eight candidates, also with zero refutations. The reinterpretation of Fisher-SAM as epistemic pessimism examined no candidates. This limited search scope—thirteen papers from semantic retrieval—suggests the analysis captures closely related policy optimization methods but may not cover the full breadth of uncertainty-driven safe exploration approaches or sharpness-aware techniques from adjacent fields.

Given the search scale, the work appears to occupy a relatively distinct position within policy optimization methods, combining sharpness awareness with safety constraints in a novel way. However, the analysis does not exhaustively cover connections to broader sharpness-aware learning literature or alternative epistemic uncertainty quantification methods outside the top-thirteen semantic matches. The taxonomy structure suggests this is an active but not overcrowded research direction, with room for differentiation among the four sibling papers in the same leaf.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: safe exploration in reinforcement learning under constraints. The field addresses how agents can learn effective policies while respecting safety requirements throughout the learning process. The taxonomy reveals a rich structure organized around several complementary perspectives. Constraint Formulation and Problem Frameworks establish the mathematical foundations, defining how safety requirements are encoded—ranging from hard constraints in Safety Constrained MDPs[13] to probabilistic formulations like Hierarchical Chance Constraints[4]. Algorithm Design for Safe Exploration encompasses policy optimization methods that directly integrate safety into learning, while Uncertainty Quantification and Epistemic Safety focuses on managing unknown risks through conservative estimation, as seen in Conservative Safety Critics[8] and Epistemic Uncertainty[46]. Geometric perspectives emerge in Constraint Manifold approaches like Constraint Manifold[7] and Robot Constraint Manifold[16], which exploit the structure of feasible state spaces. The taxonomy also distinguishes between Statewise and Instantaneous Safety (ensuring safety at every step versus over trajectories), includes Multi-Agent and Hierarchical settings, and recognizes Interactive and Human-in-the-Loop methods such as Natural Language Constraints[6]. Application Domains and Exploration Strategy Design round out the landscape, connecting theory to practice. Several active research directions reveal key trade-offs between exploration efficiency and safety guarantees. Works emphasizing provable safety like Provable Guarantees[17] and Zero Constraint Violation[38] contrast with methods that balance constraint satisfaction with learning speed, such as Recovery RL[3] which allows temporary violations with recovery mechanisms. SHAPO[0] sits within the Policy Optimization with Safety Constraints branch alongside Constrained PPO[11] and Feasible Actor Critic[30], sharing their focus on integrating constraint handling directly into policy gradient methods. Compared to Multi Objective Safety[14], which frames safety as one objective among many, SHAPO emphasizes constraint satisfaction as a hard requirement rather than a preference to be traded off. The positioning reflects a broader tension in the field: whether to pursue conservative approaches that guarantee safety from the outset or to enable more aggressive exploration with corrective mechanisms, a question that remains central as methods scale to complex real-world domains.

Claimed Contributions

Sharpness-Aware Policy Optimization (SHAPO) method

5 retrieved papers

SHAPO is a novel policy update method that computes gradients at perturbed parameters to incorporate the actor's epistemic uncertainty. This approach makes policy updates pessimistic by evaluating the gradient at an adjusted parameter that minimizes expected return within a trust region, thereby promoting conservative behavior in under-explored regions.

5 retrieved papers

Reinterpretation of Fisher-SAM as pessimism under epistemic uncertainty

0 retrieved papers

The authors provide a theoretical reinterpretation showing that the sharpness-aware parameter perturbation corresponds to optimizing under epistemic uncertainty about policy parameters. They demonstrate that the adjusted parameter can be viewed as the most likely parameter falling in the lower tail of the uncertainty distribution, thereby formalizing the pessimistic bias.

0 retrieved papers

Analytical characterization of gradient reweighting for rare actions

8 retrieved papers

Through analysis in a simplified Gaussian policy setting, the authors show that SHAPO's gradient modification assigns greater weight to rare unsafe actions (negative advantage) while downweighting rare safe actions (positive advantage). This reweighting mechanism explains how SHAPO promotes safe exploration by treating unsafe rare events more seriously during policy updates.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Model-based safe deep reinforcement learning via a constrained proximal policy optimization algorithm PDF

Jayant, Ashish Kumar, Bhatnagar, Shalabh, Ashish Kumar Jayant, S. Bhatnagar (2022)

[14] Safety optimized reinforcement learning via multi-objective policy optimization PDF

Homayoun Honari, Mehran Ghafarian Tamizi, Homayoun Najjaran (2024)

[30] Feasible actor-critic: Constrained reinforcement learning for ensuring statewise safety PDF

Ma, Haitong, Haitong Ma, Guan Yang, Yang Guan, Li, Shegnbo Eben, Shengbo Eben Li, Zhang, Xiangteng, Xiangteng Zhang, Shegnbo Eben Li, Zheng, Sifa, Sifa Zheng, Chen, Jianyu, Jianyu Chen (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sharpness-Aware Policy Optimization (SHAPO) method

[51] Momentum-sam: Sharpness aware minimization without computational overhead PDF

Cannot Refute

[52] Domain-inspired sharpness-aware minimization under domain shifts PDF

Cannot Refute

[53] Improving generalization of robot locomotion policies via Sharpness-Aware Reinforcement Learning PDF

Cannot Refute

[54] Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization PDF

Cannot Refute

[55] Distribution-Free Uncertainty Quantification for Kernel Methods by Gradient Perturbations PDF

Cannot Refute

Contribution

Reinterpretation of Fisher-SAM as pessimism under epistemic uncertainty

Contribution

Analytical characterization of gradient reweighting for rare actions

[11] Model-based safe deep reinforcement learning via a constrained proximal policy optimization algorithm PDF

Cannot Refute

[43] Probabilistic constraint for safety-critical reinforcement learning PDF

Cannot Refute

[56] Safe Reinforcement Learning through Phasic Safety-Oriented Policy Optimization. PDF

Cannot Refute

[57] Scalable Safe Reinforcement Learning via Neural Approximations of Control-Theoretic Regulators PDF

Cannot Refute

[58] Task-Oriented Learning from Positive-Unlabeled Data: Addressing Imbalance, Bias, and Uncertainty PDF

Cannot Refute

[59] Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients PDF

Cannot Refute

[60] Deep Reinforcement Learning From Demonstrations to Assist Service Restoration in Islanded Microgrids PDF

Cannot Refute

[61] Primal-Dual Policy Gradient and Augmented MDP Approaches PDF

Cannot Refute

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Model-based safe deep reinforcement learning via a constrained proximal policy optimization algorithm PDF

[14] Safety optimized reinforcement learning via multi-objective policy optimization PDF

[30] Feasible actor-critic: Constrained reinforcement learning for ensuring statewise safety PDF

Contribution Analysis

Sharpness-Aware Policy Optimization (SHAPO) method

[51] Momentum-sam: Sharpness aware minimization without computational overhead PDF

[52] Domain-inspired sharpness-aware minimization under domain shifts PDF

[53] Improving generalization of robot locomotion policies via Sharpness-Aware Reinforcement Learning PDF

[54] Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization PDF

[55] Distribution-Free Uncertainty Quantification for Kernel Methods by Gradient Perturbations PDF

Reinterpretation of Fisher-SAM as pessimism under epistemic uncertainty

Analytical characterization of gradient reweighting for rare actions

[11] Model-based safe deep reinforcement learning via a constrained proximal policy optimization algorithm PDF

[43] Probabilistic constraint for safety-critical reinforcement learning PDF

[56] Safe Reinforcement Learning through Phasic Safety-Oriented Policy Optimization. PDF

[57] Scalable Safe Reinforcement Learning via Neural Approximations of Control-Theoretic Regulators PDF

[58] Task-Oriented Learning from Positive-Unlabeled Data: Addressing Imbalance, Bias, and Uncertainty PDF

[59] Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients PDF

[60] Deep Reinforcement Learning From Demonstrations to Assist Service Restoration in Islanded Microgrids PDF

[61] Primal-Dual Policy Gradient and Augmented MDP Approaches PDF

Table of Contents