Breaking Safety Paradox with Feasible Dual Policy Iteration

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.7 Download Report PDF

Safety paradoxsafe reinforcement learningfeasible policy iterationfeasibility function

Achieving zero constraint violations in safe reinforcement learning poses a significant challenge. We discover a key obstacle called the safety paradox, where improving policy safety reduces the frequency of constraint-violating samples, thereby impairing feasibility function estimation and ultimately undermining policy safety. We theoretically prove that the estimation error bound of the feasibility function increases as the proportion of violating samples decreases. To overcome the safety paradox, we propose an algorithm called feasible dual policy iteration (FDPI), which employs an additional policy to strategically maximize constraint violations while staying close to the original policy. Samples from both policies are combined for training, with data distribution corrected by importance sampling. Extensive experiments show FDPI's state-of-the-art performance on the Safety-Gymnasium benchmark, achieving the lowest violation and competitive-to-best return simultaneously.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the safety paradox—a phenomenon where improving policy safety reduces constraint-violating samples, thereby degrading feasibility function estimation—and proposes feasible dual policy iteration (FDPI) to address it. Within the taxonomy, this work resides in the Feasible Policy Iteration Frameworks leaf under Feasibility-Guided and Recovery-Based Policies. This leaf contains only three papers total, including the original work and two siblings (Feasible Policy Iteration and Safe Exploration Iteration). The sparse population suggests this is a relatively focused research direction rather than a crowded subfield, with FDPI representing one of the few explicit dual-policy approaches to feasibility-guided learning.

The taxonomy reveals that neighboring leaves employ distinct safety mechanisms: Recovery and Safety Editor Policies use separate recovery behaviors, while Offline Safe RL with Feasibility Guidance operates without online violations. The broader Feasibility-Guided branch sits alongside Safety Function and Barrier-Based Methods (which synthesize control barriers) and Primal-Dual and Lagrangian Optimization Methods (which adjust penalty weights). FDPI's dual-policy structure diverges from single-policy feasibility iteration and from primal-dual penalty tuning, instead maintaining two policies to balance exploration of constraint boundaries with safety preservation. This positions the work at the intersection of feasibility guidance and dual-policy architectures.

Among the three contributions analyzed, the safety paradox discovery and theoretical analysis examined five candidates with zero refutations, suggesting this framing is relatively novel within the limited search scope. The FDPI algorithm itself examined ten candidates with no refutations, indicating the dual-policy feasibility iteration structure appears distinct among the papers reviewed. However, the importance sampling scheme for distribution correction examined ten candidates and found one refutable match, suggesting this technical component has more substantial prior work. These statistics reflect a search of twenty-five total candidates, not an exhaustive literature review, so the novelty assessment is bounded by this scope.

Overall, the work appears to introduce a fresh perspective on feasibility-guided safe RL by identifying the safety paradox and proposing a dual-policy solution. The sparse taxonomy leaf and low refutation rates across most contributions suggest meaningful novelty within the examined candidate set. However, the importance sampling component shows clearer overlap with prior techniques, and the limited search scope (twenty-five candidates) means the analysis cannot rule out additional related work in the broader literature. The contribution feels most novel in its problem framing and dual-policy architecture rather than in individual technical mechanisms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Achieving zero constraint violations in safe reinforcement learning. The field is organized around several complementary strategies for ensuring that learned policies never violate safety constraints during training or deployment. Primal-Dual and Lagrangian Optimization Methods (e.g., Zero Violation Primal-Dual[1]) adjust penalty weights to balance reward maximization with constraint satisfaction. Safety Function and Barrier-Based Methods leverage control barrier functions and safety critics to certify safe actions in real time. Feasibility-Guided and Recovery-Based Policies maintain feasibility by iteratively refining policies within safe regions (Feasible Policy Iteration[11]) or by learning explicit recovery behaviors when approaching constraint boundaries. Intervention and Projection-Based Safety Mechanisms use runtime filters or shields (Shielding[47]) to override unsafe actions, while Adaptive and Versatile Constraint Handling approaches (Adaptive Safe RL[9], Constraint-Conditioned Policy[7]) adjust to varying or non-Markovian constraints. Model-Based and Robust Safety Certifiers provide formal guarantees under uncertainty, Exploration with Safety Guarantees ensures safe data collection, and Domain-Specific Safe RL Applications demonstrate these ideas in robotics, autonomous driving, and energy systems. A particularly active line of work focuses on feasibility-guided frameworks that enforce zero violations by construction. Feasible Dual Policy[0] sits squarely within this branch, emphasizing dual policy structures that maintain constraint admissibility throughout learning. This contrasts with primal-dual methods like Zero Violation Primal-Dual[1] and Zero Violation Safe Policy[2], which rely on penalty tuning rather than explicit feasibility iteration. Nearby works such as Feasible Policy Iteration[11] and Safe Exploration Iteration[50] share the emphasis on iterative refinement within safe sets, while Reducing Safety Interventions[3] explores how to minimize the need for external corrections. The central trade-off across these branches is between the computational overhead of maintaining strict feasibility versus the flexibility of softer constraint handling, with Feasible Dual Policy[0] contributing a dual-policy architecture that aims to balance both concerns within the feasibility-guided paradigm.

Claimed Contributions

Discovery and theoretical analysis of the safety paradox

5 retrieved papers

The authors identify and theoretically analyze a fundamental obstacle in safe reinforcement learning where improving policy safety paradoxically increases the estimation error bound of feasibility functions by reducing constraint-violating samples, creating a self-defeating cycle that prevents achieving zero violations.

5 retrieved papers

Feasible dual policy iteration (FDPI) algorithm

10 retrieved papers

The authors propose FDPI, a novel algorithm that breaks the safety paradox by introducing a dual policy that deliberately maximizes constraint violations while maintaining proximity to the primal policy through KL divergence constraints, with data distribution corrected via importance sampling.

10 retrieved papers

Importance sampling scheme for distribution correction

Can Refute

10 retrieved papers

The authors develop an importance sampling method to correct distributional shifts when combining data from both primal and dual policies, approximating marginal state distributions with truncated trajectory distributions and introducing KL divergence constraints to ensure numerical stability.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Feasible Policy Iteration for Safe Reinforcement Learning PDF

Yang Yu-Jie, Yujie Yang, Zheng Zhi-long, Zhilong Zheng, Li, Shengbo Eben, Shengbo Eben Li, Xu Wei, S. Li, Liu, Jingjing, Wei Xu, Zhan, Xianyuan, Jingjing Liu, Zhang, Ya-Qin, Xianyuan Zhan, Ya-Qin Zhang (2023)

[50] Feasible Policy Iteration With Guaranteed Safe Exploration PDF

Yuhang Zhang, Yujie Yang, Shengbo Eben Li, Yao Lyu, S. Li, Jingliang Duan, Zhilong Zheng, Dezhao Zhang (2025) • IEEE Transactions on Cybernetics

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discovery and theoretical analysis of the safety paradox

[14] Learn Zero-Constraint-Violation Policy in Model-Free Constrained Reinforcement Learning PDF

Cannot Refute

[15] AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training PDF

Cannot Refute

[51] Verification and repair of control policies for safe reinforcement learning PDF

Cannot Refute

[52] Navigating safety: Necessary compromises and trade-offs-theory and practice PDF

Cannot Refute

[53] From Prediction to Prescription: Bridging Management Science and Frontier Machine Learning PDF

Cannot Refute

Contribution

Feasible dual policy iteration (FDPI) algorithm

[21] Recovery rl: Safe reinforcement learning with learned recovery zones PDF

Cannot Refute

[23] Learning policies with zero or bounded constraint violation for constrained mdps PDF

Cannot Refute

[26] Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning PDF

Cannot Refute

[54] From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning PDF

Cannot Refute

[55] A Twin Primal-Dual DDPG Algorithm for Safety-Constrained Reinforcement Learning PDF

Cannot Refute

[56] Off-Policy Primal-Dual Safe Reinforcement Learning PDF

Cannot Refute

[57] Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning PDF

Cannot Refute

[58] Dual variable actor-critic for adaptive safe reinforcement learning PDF

Cannot Refute

[59] P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL PDF

Cannot Refute

[60] Safe Reinforcement Learning via Control-Theoretic Regularization: A Dual-Agent Framework with Hard Safety Guarantees PDF

Cannot Refute

Contribution

Importance sampling scheme for distribution correction

[61] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures PDF

Can Refute

[62] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization PDF

Cannot Refute

[63] Average-DICE: Stationary Distribution Correction by Regression PDF

Cannot Refute

[64] Policy Gradient with Active Importance Sampling PDF

Cannot Refute

[65] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning PDF

Cannot Refute

[66] Oasis: Conditional distribution shaping for offline safe reinforcement learning PDF

Cannot Refute

[67] Lifelong hyper-policy optimization with multiple importance sampling regularization PDF

Cannot Refute

[68] Off-Policy Correction For Multi-Agent Reinforcement Learning PDF

Cannot Refute

[69] Counterfactual-augmented importance sampling for semi-offline policy evaluation PDF

Cannot Refute

[70] Low Variance Off-policy Evaluation with State-based Importance Sampling PDF

Cannot Refute

Breaking Safety Paradox with Feasible Dual Policy Iteration

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Feasible Policy Iteration for Safe Reinforcement Learning PDF

[50] Feasible Policy Iteration With Guaranteed Safe Exploration PDF

Contribution Analysis

Discovery and theoretical analysis of the safety paradox

[14] Learn Zero-Constraint-Violation Policy in Model-Free Constrained Reinforcement Learning PDF

[15] AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training PDF

[51] Verification and repair of control policies for safe reinforcement learning PDF

[52] Navigating safety: Necessary compromises and trade-offs-theory and practice PDF

[53] From Prediction to Prescription: Bridging Management Science and Frontier Machine Learning PDF

Feasible dual policy iteration (FDPI) algorithm

[21] Recovery rl: Safe reinforcement learning with learned recovery zones PDF

[23] Learning policies with zero or bounded constraint violation for constrained mdps PDF

[26] Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning PDF

[54] From Text to Trajectory: Exploring Complex Constraint Representation and Decomposition in Safe Reinforcement Learning PDF

[55] A Twin Primal-Dual DDPG Algorithm for Safety-Constrained Reinforcement Learning PDF

[56] Off-Policy Primal-Dual Safe Reinforcement Learning PDF

[57] Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning PDF

[58] Dual variable actor-critic for adaptive safe reinforcement learning PDF

[59] P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL PDF

[60] Safe Reinforcement Learning via Control-Theoretic Regularization: A Dual-Agent Framework with Hard Safety Guarantees PDF

Importance sampling scheme for distribution correction

[61] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures PDF

[62] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization PDF

[63] Average-DICE: Stationary Distribution Correction by Regression PDF

[64] Policy Gradient with Active Importance Sampling PDF

[65] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning PDF

[66] Oasis: Conditional distribution shaping for offline safe reinforcement learning PDF

[67] Lifelong hyper-policy optimization with multiple importance sampling regularization PDF

[68] Off-Policy Correction For Multi-Agent Reinforcement Learning PDF

[69] Counterfactual-augmented importance sampling for semi-offline policy evaluation PDF

[70] Low Variance Off-policy Evaluation with State-based Importance Sampling PDF

Table of Contents