Safe Exploration via Policy Priors

ICLR 2026 Conference SubmissionAnonymous Authors
Deep Reinforcement LearningSafe ExplorationSafe RLConstrained Markov Decision Processes
Abstract:

Safe exploration is a key requirement for reinforcement learning agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SOOPER, an algorithm that uses probabilistic dynamics models to balance optimistic exploration with pessimistic fallback to conservative policy priors, ensuring safety throughout learning. It resides in the 'Optimistic-Pessimistic Exploration with Policy Priors' leaf, which contains only two papers total (including this one). This indicates a relatively sparse research direction within the broader safe RL landscape, suggesting the specific combination of optimistic-pessimistic balancing with explicit prior fallback mechanisms remains underexplored compared to adjacent areas like constraint-based methods or risk-aware techniques.

The taxonomy reveals that SOOPER's leaf sits within 'Exploration Strategies with Safety Guarantees', a branch containing four leaves and thirteen papers total. Neighboring leaves include 'Risk-Aware and Uncertainty-Based Methods' (three papers focusing on probabilistic safety without explicit prior fallback) and 'Adaptive Safe Action Selection' (two papers emphasizing step-by-step action filtering). The sibling paper in SOOPER's leaf addresses similar optimistic-pessimistic trade-offs but may differ in implementation details or theoretical frameworks. The taxonomy's scope notes clarify that methods without explicit prior fallback mechanisms belong to adjacent categories, positioning SOOPER at a specific intersection of prior-guided learning and exploration guarantees.

Among twenty-four candidates examined, the contribution-level analysis reveals mixed novelty signals. The core SOOPER algorithm (Contribution 1) examined four candidates with zero refutations, suggesting limited direct overlap in algorithmic design. However, the theoretical guarantees (Contribution 2) examined ten candidates and found three potential refutations, indicating that safety and regret bounds in this setting may have substantial prior work. The empirical validation (Contribution 3) examined ten candidates with no refutations, though this likely reflects differences in experimental domains rather than fundamental novelty. The limited search scope means these findings capture top semantic matches, not exhaustive coverage.

Given the sparse taxonomy leaf and limited search scale, SOOPER appears to occupy a relatively underexplored niche combining policy priors with optimistic-pessimistic exploration. The algorithmic contribution shows stronger novelty signals than the theoretical guarantees, where prior work on safety and regret analysis appears more developed. The analysis is constrained by examining only twenty-four candidates from semantic search, leaving open the possibility of additional relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers
35
3
Claimed Contributions
24
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: safe exploration in reinforcement learning with policy priors. The field addresses how agents can learn effectively while respecting safety constraints by leveraging prior knowledge or baseline policies. The taxonomy reveals four main branches that capture complementary perspectives on this challenge. Policy Prior Integration Mechanisms examines architectural and algorithmic strategies for combining learned policies with existing controllers or expert demonstrations, including methods like controller fusion and residual policy learning. Constraint-Based Safe Learning focuses on formulating and enforcing explicit safety constraints during training, often through optimization frameworks that balance performance objectives with hard or soft safety requirements. Exploration Strategies with Safety Guarantees develops principled approaches to guide exploration toward informative yet safe regions of the state-action space, sometimes employing optimistic-pessimistic trade-offs or probabilistic safety certificates. Domain-Specific Safe RL Applications translates these ideas into concrete settings such as robotics, autonomous driving, and power systems, where real-world consequences demand reliable safety assurances. Within the exploration strategies branch, a particularly active line of work investigates how to balance optimism for learning with pessimism for safety, often using policy priors as anchors to prevent catastrophic failures. Safe Exploration Policy Priors[0] sits squarely in this optimistic-pessimistic exploration cluster, sharing thematic connections with Safety-Efficiency Balanced Navigation[26], which similarly navigates the tension between task performance and constraint satisfaction. Nearby efforts like Safe Interactive Learning[3] and Efficient Safe RL Sampling[7] emphasize sample-efficient exploration under safety constraints, while works such as Bayesian Controller Fusion[2] and Adaptive Regularization Safe Control[8] blend prior knowledge with adaptive learning mechanisms. The original paper's emphasis on policy priors as a foundation for safe exploration distinguishes it from purely constraint-driven approaches, positioning it as a bridge between integration mechanisms and exploration guarantees that leverages existing knowledge to guide safe discovery of improved behaviors.

Claimed Contributions

SOOPER algorithm for safe exploration using policy priors

The authors introduce SOOPER, a model-based reinforcement learning algorithm that uses suboptimal yet conservative policies (obtained from offline data or simulators) as priors. The algorithm employs probabilistic dynamics models to explore optimistically while pessimistically falling back to the conservative policy prior when needed to maintain safety.

4 retrieved papers
Theoretical guarantees for safety and cumulative regret bound

The authors provide theoretical analysis proving that SOOPER maintains safety throughout learning with high probability and establishes a novel bound on cumulative regret. This improves over prior works that only guarantee optimality at the end of training, by ensuring good performance during exploration as well.

10 retrieved papers
Can Refute
Empirical validation on benchmarks and real-world hardware

The authors perform extensive experiments demonstrating that SOOPER outperforms state-of-the-art baselines on standard safe RL benchmarks and validate the approach on real-world robotic hardware, providing empirical evidence that their theoretical guarantees translate to practice.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SOOPER algorithm for safe exploration using policy priors

The authors introduce SOOPER, a model-based reinforcement learning algorithm that uses suboptimal yet conservative policies (obtained from offline data or simulators) as priors. The algorithm employs probabilistic dynamics models to explore optimistically while pessimistically falling back to the conservative policy prior when needed to maintain safety.

Contribution

Theoretical guarantees for safety and cumulative regret bound

The authors provide theoretical analysis proving that SOOPER maintains safety throughout learning with high probability and establishes a novel bound on cumulative regret. This improves over prior works that only guarantee optimality at the end of training, by ensuring good performance during exploration as well.

Contribution

Empirical validation on benchmarks and real-world hardware

The authors perform extensive experiments demonstrating that SOOPER outperforms state-of-the-art baselines on standard safe RL benchmarks and validate the approach on real-world robotic hardware, providing empirical evidence that their theoretical guarantees translate to practice.