SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

ICLR 2026 Conference SubmissionAnonymous Authors
Safe ExplorationSharpness Aware MinimizationEpistemic Uncertainty
Abstract:

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor’s sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor’s epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Sharpness-Aware Policy Optimization (SHAPO), which uses parameter perturbations as a proxy for epistemic uncertainty to guide safe exploration. It resides in the 'Policy Optimization with Safety Constraints' leaf, which contains four papers including the original work. This leaf sits within the broader 'Algorithm Design for Safe Exploration' branch, indicating a moderately populated research direction focused on integrating safety directly into policy gradient updates rather than using post-hoc corrections or model-based planning.

The taxonomy reveals neighboring approaches that handle safety through different mechanisms. The 'Safety Layer and Action Correction Methods' leaf (three papers) applies analytical filters after policy decisions, while 'Model-Based Safe RL' (two papers) leverages dynamics models for predictive safety. The 'Uncertainty Quantification and Epistemic Safety' branch addresses similar concerns about unknown risks but through conservative critics and probabilistic constraints rather than sharpness-aware updates. SHAPO's epistemic uncertainty framing connects conceptually to this branch, though it operationalizes uncertainty differently via parameter sensitivity rather than explicit distributional modeling.

Among thirteen candidates examined across three contributions, none were identified as clearly refuting the work. The core SHAPO method examined five candidates with zero refutations, while the analytical gradient reweighting characterization examined eight candidates, also with zero refutations. The reinterpretation of Fisher-SAM as epistemic pessimism examined no candidates. This limited search scope—thirteen papers from semantic retrieval—suggests the analysis captures closely related policy optimization methods but may not cover the full breadth of uncertainty-driven safe exploration approaches or sharpness-aware techniques from adjacent fields.

Given the search scale, the work appears to occupy a relatively distinct position within policy optimization methods, combining sharpness awareness with safety constraints in a novel way. However, the analysis does not exhaustively cover connections to broader sharpness-aware learning literature or alternative epistemic uncertainty quantification methods outside the top-thirteen semantic matches. The taxonomy structure suggests this is an active but not overcrowded research direction, with room for differentiation among the four sibling papers in the same leaf.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: safe exploration in reinforcement learning under constraints. The field addresses how agents can learn effective policies while respecting safety requirements throughout the learning process. The taxonomy reveals a rich structure organized around several complementary perspectives. Constraint Formulation and Problem Frameworks establish the mathematical foundations, defining how safety requirements are encoded—ranging from hard constraints in Safety Constrained MDPs[13] to probabilistic formulations like Hierarchical Chance Constraints[4]. Algorithm Design for Safe Exploration encompasses policy optimization methods that directly integrate safety into learning, while Uncertainty Quantification and Epistemic Safety focuses on managing unknown risks through conservative estimation, as seen in Conservative Safety Critics[8] and Epistemic Uncertainty[46]. Geometric perspectives emerge in Constraint Manifold approaches like Constraint Manifold[7] and Robot Constraint Manifold[16], which exploit the structure of feasible state spaces. The taxonomy also distinguishes between Statewise and Instantaneous Safety (ensuring safety at every step versus over trajectories), includes Multi-Agent and Hierarchical settings, and recognizes Interactive and Human-in-the-Loop methods such as Natural Language Constraints[6]. Application Domains and Exploration Strategy Design round out the landscape, connecting theory to practice. Several active research directions reveal key trade-offs between exploration efficiency and safety guarantees. Works emphasizing provable safety like Provable Guarantees[17] and Zero Constraint Violation[38] contrast with methods that balance constraint satisfaction with learning speed, such as Recovery RL[3] which allows temporary violations with recovery mechanisms. SHAPO[0] sits within the Policy Optimization with Safety Constraints branch alongside Constrained PPO[11] and Feasible Actor Critic[30], sharing their focus on integrating constraint handling directly into policy gradient methods. Compared to Multi Objective Safety[14], which frames safety as one objective among many, SHAPO emphasizes constraint satisfaction as a hard requirement rather than a preference to be traded off. The positioning reflects a broader tension in the field: whether to pursue conservative approaches that guarantee safety from the outset or to enable more aggressive exploration with corrective mechanisms, a question that remains central as methods scale to complex real-world domains.

Claimed Contributions

Sharpness-Aware Policy Optimization (SHAPO) method

SHAPO is a novel policy update method that computes gradients at perturbed parameters to incorporate the actor's epistemic uncertainty. This approach makes policy updates pessimistic by evaluating the gradient at an adjusted parameter that minimizes expected return within a trust region, thereby promoting conservative behavior in under-explored regions.

5 retrieved papers
Reinterpretation of Fisher-SAM as pessimism under epistemic uncertainty

The authors provide a theoretical reinterpretation showing that the sharpness-aware parameter perturbation corresponds to optimizing under epistemic uncertainty about policy parameters. They demonstrate that the adjusted parameter can be viewed as the most likely parameter falling in the lower tail of the uncertainty distribution, thereby formalizing the pessimistic bias.

0 retrieved papers
Analytical characterization of gradient reweighting for rare actions

Through analysis in a simplified Gaussian policy setting, the authors show that SHAPO's gradient modification assigns greater weight to rare unsafe actions (negative advantage) while downweighting rare safe actions (positive advantage). This reweighting mechanism explains how SHAPO promotes safe exploration by treating unsafe rare events more seriously during policy updates.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sharpness-Aware Policy Optimization (SHAPO) method

SHAPO is a novel policy update method that computes gradients at perturbed parameters to incorporate the actor's epistemic uncertainty. This approach makes policy updates pessimistic by evaluating the gradient at an adjusted parameter that minimizes expected return within a trust region, thereby promoting conservative behavior in under-explored regions.

Contribution

Reinterpretation of Fisher-SAM as pessimism under epistemic uncertainty

The authors provide a theoretical reinterpretation showing that the sharpness-aware parameter perturbation corresponds to optimizing under epistemic uncertainty about policy parameters. They demonstrate that the adjusted parameter can be viewed as the most likely parameter falling in the lower tail of the uncertainty distribution, thereby formalizing the pessimistic bias.

Contribution

Analytical characterization of gradient reweighting for rare actions

Through analysis in a simplified Gaussian policy setting, the authors show that SHAPO's gradient modification assigns greater weight to rare unsafe actions (negative advantage) while downweighting rare safe actions (positive advantage). This reweighting mechanism explains how SHAPO promotes safe exploration by treating unsafe rare events more seriously during policy updates.