Breaking Safety Paradox with Feasible Dual Policy Iteration

ICLR 2026 Conference SubmissionAnonymous Authors
Safety paradoxsafe reinforcement learningfeasible policy iterationfeasibility function
Abstract:

Achieving zero constraint violations in safe reinforcement learning poses a significant challenge. We discover a key obstacle called the safety paradox, where improving policy safety reduces the frequency of constraint-violating samples, thereby impairing feasibility function estimation and ultimately undermining policy safety. We theoretically prove that the estimation error bound of the feasibility function increases as the proportion of violating samples decreases. To overcome the safety paradox, we propose an algorithm called feasible dual policy iteration (FDPI), which employs an additional policy to strategically maximize constraint violations while staying close to the original policy. Samples from both policies are combined for training, with data distribution corrected by importance sampling. Extensive experiments show FDPI's state-of-the-art performance on the Safety-Gymnasium benchmark, achieving the lowest violation and competitive-to-best return simultaneously.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the safety paradox—a phenomenon where improving policy safety reduces constraint-violating samples, thereby degrading feasibility function estimation—and proposes feasible dual policy iteration (FDPI) to address it. Within the taxonomy, this work resides in the Feasible Policy Iteration Frameworks leaf under Feasibility-Guided and Recovery-Based Policies. This leaf contains only three papers total, including the original work and two siblings (Feasible Policy Iteration and Safe Exploration Iteration). The sparse population suggests this is a relatively focused research direction rather than a crowded subfield, with FDPI representing one of the few explicit dual-policy approaches to feasibility-guided learning.

The taxonomy reveals that neighboring leaves employ distinct safety mechanisms: Recovery and Safety Editor Policies use separate recovery behaviors, while Offline Safe RL with Feasibility Guidance operates without online violations. The broader Feasibility-Guided branch sits alongside Safety Function and Barrier-Based Methods (which synthesize control barriers) and Primal-Dual and Lagrangian Optimization Methods (which adjust penalty weights). FDPI's dual-policy structure diverges from single-policy feasibility iteration and from primal-dual penalty tuning, instead maintaining two policies to balance exploration of constraint boundaries with safety preservation. This positions the work at the intersection of feasibility guidance and dual-policy architectures.

Among the three contributions analyzed, the safety paradox discovery and theoretical analysis examined five candidates with zero refutations, suggesting this framing is relatively novel within the limited search scope. The FDPI algorithm itself examined ten candidates with no refutations, indicating the dual-policy feasibility iteration structure appears distinct among the papers reviewed. However, the importance sampling scheme for distribution correction examined ten candidates and found one refutable match, suggesting this technical component has more substantial prior work. These statistics reflect a search of twenty-five total candidates, not an exhaustive literature review, so the novelty assessment is bounded by this scope.

Overall, the work appears to introduce a fresh perspective on feasibility-guided safe RL by identifying the safety paradox and proposing a dual-policy solution. The sparse taxonomy leaf and low refutation rates across most contributions suggest meaningful novelty within the examined candidate set. However, the importance sampling component shows clearer overlap with prior techniques, and the limited search scope (twenty-five candidates) means the analysis cannot rule out additional related work in the broader literature. The contribution feels most novel in its problem framing and dual-policy architecture rather than in individual technical mechanisms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Achieving zero constraint violations in safe reinforcement learning. The field is organized around several complementary strategies for ensuring that learned policies never violate safety constraints during training or deployment. Primal-Dual and Lagrangian Optimization Methods (e.g., Zero Violation Primal-Dual[1]) adjust penalty weights to balance reward maximization with constraint satisfaction. Safety Function and Barrier-Based Methods leverage control barrier functions and safety critics to certify safe actions in real time. Feasibility-Guided and Recovery-Based Policies maintain feasibility by iteratively refining policies within safe regions (Feasible Policy Iteration[11]) or by learning explicit recovery behaviors when approaching constraint boundaries. Intervention and Projection-Based Safety Mechanisms use runtime filters or shields (Shielding[47]) to override unsafe actions, while Adaptive and Versatile Constraint Handling approaches (Adaptive Safe RL[9], Constraint-Conditioned Policy[7]) adjust to varying or non-Markovian constraints. Model-Based and Robust Safety Certifiers provide formal guarantees under uncertainty, Exploration with Safety Guarantees ensures safe data collection, and Domain-Specific Safe RL Applications demonstrate these ideas in robotics, autonomous driving, and energy systems. A particularly active line of work focuses on feasibility-guided frameworks that enforce zero violations by construction. Feasible Dual Policy[0] sits squarely within this branch, emphasizing dual policy structures that maintain constraint admissibility throughout learning. This contrasts with primal-dual methods like Zero Violation Primal-Dual[1] and Zero Violation Safe Policy[2], which rely on penalty tuning rather than explicit feasibility iteration. Nearby works such as Feasible Policy Iteration[11] and Safe Exploration Iteration[50] share the emphasis on iterative refinement within safe sets, while Reducing Safety Interventions[3] explores how to minimize the need for external corrections. The central trade-off across these branches is between the computational overhead of maintaining strict feasibility versus the flexibility of softer constraint handling, with Feasible Dual Policy[0] contributing a dual-policy architecture that aims to balance both concerns within the feasibility-guided paradigm.

Claimed Contributions

Discovery and theoretical analysis of the safety paradox

The authors identify and theoretically analyze a fundamental obstacle in safe reinforcement learning where improving policy safety paradoxically increases the estimation error bound of feasibility functions by reducing constraint-violating samples, creating a self-defeating cycle that prevents achieving zero violations.

5 retrieved papers
Feasible dual policy iteration (FDPI) algorithm

The authors propose FDPI, a novel algorithm that breaks the safety paradox by introducing a dual policy that deliberately maximizes constraint violations while maintaining proximity to the primal policy through KL divergence constraints, with data distribution corrected via importance sampling.

10 retrieved papers
Importance sampling scheme for distribution correction

The authors develop an importance sampling method to correct distributional shifts when combining data from both primal and dual policies, approximating marginal state distributions with truncated trajectory distributions and introducing KL divergence constraints to ensure numerical stability.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discovery and theoretical analysis of the safety paradox

The authors identify and theoretically analyze a fundamental obstacle in safe reinforcement learning where improving policy safety paradoxically increases the estimation error bound of feasibility functions by reducing constraint-violating samples, creating a self-defeating cycle that prevents achieving zero violations.

Contribution

Feasible dual policy iteration (FDPI) algorithm

The authors propose FDPI, a novel algorithm that breaks the safety paradox by introducing a dual policy that deliberately maximizes constraint violations while maintaining proximity to the primal policy through KL divergence constraints, with data distribution corrected via importance sampling.

Contribution

Importance sampling scheme for distribution correction

The authors develop an importance sampling method to correct distributional shifts when combining data from both primal and dual policies, approximating marginal state distributions with truncated trajectory distributions and introducing KL divergence constraints to ensure numerical stability.