Safe Exploration via Policy Priors

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

Deep Reinforcement LearningSafe ExplorationSafe RLConstrained Markov Decision Processes

Safe exploration is a key requirement for reinforcement learning agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SOOPER, an algorithm that uses probabilistic dynamics models to balance optimistic exploration with pessimistic fallback to conservative policy priors, ensuring safety throughout learning. It resides in the 'Optimistic-Pessimistic Exploration with Policy Priors' leaf, which contains only two papers total (including this one). This indicates a relatively sparse research direction within the broader safe RL landscape, suggesting the specific combination of optimistic-pessimistic balancing with explicit prior fallback mechanisms remains underexplored compared to adjacent areas like constraint-based methods or risk-aware techniques.

The taxonomy reveals that SOOPER's leaf sits within 'Exploration Strategies with Safety Guarantees', a branch containing four leaves and thirteen papers total. Neighboring leaves include 'Risk-Aware and Uncertainty-Based Methods' (three papers focusing on probabilistic safety without explicit prior fallback) and 'Adaptive Safe Action Selection' (two papers emphasizing step-by-step action filtering). The sibling paper in SOOPER's leaf addresses similar optimistic-pessimistic trade-offs but may differ in implementation details or theoretical frameworks. The taxonomy's scope notes clarify that methods without explicit prior fallback mechanisms belong to adjacent categories, positioning SOOPER at a specific intersection of prior-guided learning and exploration guarantees.

Among twenty-four candidates examined, the contribution-level analysis reveals mixed novelty signals. The core SOOPER algorithm (Contribution 1) examined four candidates with zero refutations, suggesting limited direct overlap in algorithmic design. However, the theoretical guarantees (Contribution 2) examined ten candidates and found three potential refutations, indicating that safety and regret bounds in this setting may have substantial prior work. The empirical validation (Contribution 3) examined ten candidates with no refutations, though this likely reflects differences in experimental domains rather than fundamental novelty. The limited search scope means these findings capture top semantic matches, not exhaustive coverage.

Given the sparse taxonomy leaf and limited search scale, SOOPER appears to occupy a relatively underexplored niche combining policy priors with optimistic-pessimistic exploration. The algorithmic contribution shows stronger novelty signals than the theoretical guarantees, where prior work on safety and regret analysis appears more developed. The analysis is constrained by examining only twenty-four candidates from semantic search, leaving open the possibility of additional relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: safe exploration in reinforcement learning with policy priors. The field addresses how agents can learn effectively while respecting safety constraints by leveraging prior knowledge or baseline policies. The taxonomy reveals four main branches that capture complementary perspectives on this challenge. Policy Prior Integration Mechanisms examines architectural and algorithmic strategies for combining learned policies with existing controllers or expert demonstrations, including methods like controller fusion and residual policy learning. Constraint-Based Safe Learning focuses on formulating and enforcing explicit safety constraints during training, often through optimization frameworks that balance performance objectives with hard or soft safety requirements. Exploration Strategies with Safety Guarantees develops principled approaches to guide exploration toward informative yet safe regions of the state-action space, sometimes employing optimistic-pessimistic trade-offs or probabilistic safety certificates. Domain-Specific Safe RL Applications translates these ideas into concrete settings such as robotics, autonomous driving, and power systems, where real-world consequences demand reliable safety assurances. Within the exploration strategies branch, a particularly active line of work investigates how to balance optimism for learning with pessimism for safety, often using policy priors as anchors to prevent catastrophic failures. Safe Exploration Policy Priors[0] sits squarely in this optimistic-pessimistic exploration cluster, sharing thematic connections with Safety-Efficiency Balanced Navigation[26], which similarly navigates the tension between task performance and constraint satisfaction. Nearby efforts like Safe Interactive Learning[3] and Efficient Safe RL Sampling[7] emphasize sample-efficient exploration under safety constraints, while works such as Bayesian Controller Fusion[2] and Adaptive Regularization Safe Control[8] blend prior knowledge with adaptive learning mechanisms. The original paper's emphasis on policy priors as a foundation for safe exploration distinguishes it from purely constraint-driven approaches, positioning it as a bridge between integration mechanisms and exploration guarantees that leverages existing knowledge to guide safe discovery of improved behaviors.

Claimed Contributions

SOOPER algorithm for safe exploration using policy priors

4 retrieved papers

The authors introduce SOOPER, a model-based reinforcement learning algorithm that uses suboptimal yet conservative policies (obtained from offline data or simulators) as priors. The algorithm employs probabilistic dynamics models to explore optimistically while pessimistically falling back to the conservative policy prior when needed to maintain safety.

4 retrieved papers

Theoretical guarantees for safety and cumulative regret bound

Can Refute

10 retrieved papers

The authors provide theoretical analysis proving that SOOPER maintains safety throughout learning with high probability and establishes a novel bound on cumulative regret. This improves over prior works that only guarantee optimality at the end of training, by ensuring good performance during exploration as well.

10 retrieved papers

Can Refute

Empirical validation on benchmarks and real-world hardware

10 retrieved papers

The authors perform extensive experiments demonstrating that SOOPER outperforms state-of-the-art baselines on standard safe RL benchmarks and validate the approach on real-world robotic hardware, providing empirical evidence that their theoretical guarantees translate to practice.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[26] SafetyâEfficiency Balanced Navigation for Unmanned Tracked Vehicles in Uneven Terrain Using Prior-Based Ensemble Deep Reinforcement Learning PDF

Yiming Xu, Dianhao Zhang, Mien Van (2025) • World Electric Vehicle Journal

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SOOPER algorithm for safe exploration using policy priors

[8] Reinforcement learning with adaptive regularization for safe control of critical systems PDF

Cannot Refute

[36] Safe model-based reinforcement learning with stability guarantees PDF

Cannot Refute

[37] A KL-regularization framework for learning to plan with adaptive priors PDF

Cannot Refute

[38] Context-Aware Policy-Guided Gradient Search for Offline Model-Based Optimization PDF

Cannot Refute

Contribution

Theoretical guarantees for safety and cumulative regret bound

[39] Safe reinforcement learning in constrained markov decision processes PDF

Can Refute

[45] Truly no-regret learning in constrained mdps PDF

Can Refute

[48] Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation PDF

Can Refute

[40] Safe reinforcement learning with contextual information: Theory and applications PDF

Cannot Refute

[41] Rethinking safe policy learning for complex constraints satisfaction: A glimpse in real-time security constrained economic dispatch integrating energy storage units PDF

Cannot Refute

[42] Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning PDF

Cannot Refute

[43] Conservative safety critics for exploration PDF

Cannot Refute

[44] Probabilistic constraint for safety-critical reinforcement learning PDF

Cannot Refute

[46] Safety and robustness in reinforcement learning PDF

Cannot Refute

[47] Regret guarantees for model-based reinforcement learning with long-term average constraints PDF

Cannot Refute

Contribution

Empirical validation on benchmarks and real-world hardware

[49] A review of safe reinforcement learning: Methods, theories and applications PDF

Cannot Refute

[50] Recovery rl: Safe reinforcement learning with learned recovery zones PDF

Cannot Refute

[51] Robot reinforcement learning on the constraint manifold PDF

Cannot Refute

[52] Safety gymnasium: A unified safe reinforcement learning benchmark PDF

Cannot Refute

[53] Benchmarking actor-critic deep reinforcement learning algorithms for robotics control with action constraints PDF

Cannot Refute

[54] Multi-robot hierarchical safe reinforcement learning autonomous decision-making strategy based on uniformly ultimate boundedness constraints PDF

Cannot Refute

[55] Safe learning in robotics: From learning-based control to safe reinforcement learning PDF

Cannot Refute

[56] Adaptive Safety-Certified Reinforcement Learning for Constrained Optimal Control of Autonomous Robots With Uncertainties PDF

Cannot Refute

[57] Reinforcement Learning Control of Shape Memory Alloy Based Soft Robotic Platform PDF

Cannot Refute

[58] Bresa: Bio-inspired Reflexive Safe Reinforcement Learning for Contact-Rich Robotic Tasks PDF

Cannot Refute

Safe Exploration via Policy Priors

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[26] SafetyâEfficiency Balanced Navigation for Unmanned Tracked Vehicles in Uneven Terrain Using Prior-Based Ensemble Deep Reinforcement Learning PDF

Contribution Analysis

SOOPER algorithm for safe exploration using policy priors

[8] Reinforcement learning with adaptive regularization for safe control of critical systems PDF

[36] Safe model-based reinforcement learning with stability guarantees PDF

[37] A KL-regularization framework for learning to plan with adaptive priors PDF

[38] Context-Aware Policy-Guided Gradient Search for Offline Model-Based Optimization PDF

Theoretical guarantees for safety and cumulative regret bound

[39] Safe reinforcement learning in constrained markov decision processes PDF

[45] Truly no-regret learning in constrained mdps PDF

[48] Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation PDF

[40] Safe reinforcement learning with contextual information: Theory and applications PDF

[41] Rethinking safe policy learning for complex constraints satisfaction: A glimpse in real-time security constrained economic dispatch integrating energy storage units PDF

[42] Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning PDF

[43] Conservative safety critics for exploration PDF

[44] Probabilistic constraint for safety-critical reinforcement learning PDF

[46] Safety and robustness in reinforcement learning PDF

[47] Regret guarantees for model-based reinforcement learning with long-term average constraints PDF

Empirical validation on benchmarks and real-world hardware

[49] A review of safe reinforcement learning: Methods, theories and applications PDF

[50] Recovery rl: Safe reinforcement learning with learned recovery zones PDF

[51] Robot reinforcement learning on the constraint manifold PDF

[52] Safety gymnasium: A unified safe reinforcement learning benchmark PDF

[53] Benchmarking actor-critic deep reinforcement learning algorithms for robotics control with action constraints PDF

[54] Multi-robot hierarchical safe reinforcement learning autonomous decision-making strategy based on uniformly ultimate boundedness constraints PDF

[55] Safe learning in robotics: From learning-based control to safe reinforcement learning PDF

[56] Adaptive Safety-Certified Reinforcement Learning for Constrained Optimal Control of Autonomous Robots With Uncertainties PDF

[57] Reinforcement Learning Control of Shape Memory Alloy Based Soft Robotic Platform PDF

[58] Bresa: Bio-inspired Reflexive Safe Reinforcement Learning for Contact-Rich Robotic Tasks PDF

Table of Contents

[26] SafetyâEfficiency Balanced Navigation for Unmanned Tracked Vehicles in Uneven Terrain Using Prior-Based Ensemble Deep Reinforcement Learning PDF