Feasible Policy Optimization for Safe Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Safe reinforcement learningfeasible policy iterationregion-wise policy optimizationconstraint decay function
Abstract:

Policy gradient methods serve as a cornerstone of reinforcement learning (RL), yet their extension to safe RL, where policies must strictly satisfy safety constraints, remains challenging. While existing methods enforce constraints in every policy update, we demonstrate that this is unnecessarily conservative. Instead, each update only needs to progressively expand the feasible region while improving the value function. Our proposed algorithm, namely feasible policy optimization (FPO), simultaneously achieves both objectives by solving a region-wise policy optimization problem. Specifically, FPO maximizes the value function inside the feasible region and minimizes the feasibility function outside it. We prove that these two sub-problems share a common optimal solution, which is obtained based on a tight bound we derive on the constraint decay function. Extensive experiments on the Safety-Gymnasium benchmark show that FPO achieves excellent constraint satisfaction while maintaining competitive task performance, striking a favorable balance between safety and return compared to state-of-the-art safe RL algorithms.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Feasible Policy Optimization (FPO), which relaxes the requirement that every policy update must satisfy constraints, instead progressively expanding the feasible region while improving value. It resides in the 'Policy Gradient Variants for Safety' leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that novel policy gradient formulations specifically tailored for strict safety remain an active but not overcrowded area of investigation.

The taxonomy reveals that FPO's leaf sits within 'Algorithm Design and Optimization Methods', adjacent to 'Primal-Dual and Lagrangian Methods' (six papers) and 'Barrier Functions and Reachability-Based Control' (three papers). These neighboring branches emphasize Lagrangian relaxation or control-theoretic guarantees, whereas FPO's region-wise optimization approach diverges by decoupling feasibility expansion from immediate constraint satisfaction. The taxonomy's scope notes clarify that policy gradient variants exclude primal-dual frameworks, positioning FPO as a distinct algorithmic paradigm rather than a refinement of existing Lagrangian schemes.

Among thirty candidates examined, the region-wise optimization framework (Contribution A) encountered three potentially refutable papers out of ten examined, indicating some prior work on progressive feasibility or trust-region methods. In contrast, the FPO algorithm itself (Contribution B) and the tight bound on constraint decay (Contribution C) each examined ten candidates with zero refutations, suggesting these specific technical elements appear more novel within the limited search scope. The analysis does not claim exhaustive coverage, so additional related work may exist beyond the top-thirty semantic matches.

Overall, the paper introduces a conceptually distinct approach to safe policy optimization, supported by a sparse taxonomy leaf and limited prior work overlap in the examined candidates. The region-wise framework shows some connection to existing trust-region ideas, while the algorithmic details and theoretical bounds appear less anticipated. These signals, drawn from a focused literature search, suggest moderate novelty, though a broader survey could reveal additional context.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Policy optimization with strict safety constraints in reinforcement learning. The field addresses how agents can learn high-performing policies while guaranteeing adherence to safety requirements during both training and deployment. The taxonomy reveals several major branches: Constraint Formulation and Theoretical Foundations establish the mathematical underpinnings (e.g., Lyapunov Safe RL[10], Safety-constrained MDPs[34]), while Algorithm Design and Optimization Methods develop practical solvers including policy gradient variants, primal-dual schemes (Accelerated Primal-Dual[25], Deterministic Primal-Dual[43]), and barrier-based techniques (Barrier Functions Safety[15]). Learning Paradigms and Data Efficiency explore offline, online, and hybrid strategies (Online Offline Safe[42]), and Multi-Agent and Scalability Extensions (Scalable Constrained Multi-agent[17]) tackle coordination under constraints. Domain Applications and Benchmarks provide testbeds (Safety Gymnasium[13]) spanning robotics (Safe Learning Robotics[29]), autonomous driving (Risk-Aware Autonomous Driving[21]), and industrial control (Blast Furnace Operation[33]). Integration with Formal Methods and Hybrid Approaches bridges symbolic verification with learning (Formal Methods Safety[26]), while Surveys and Comprehensive Reviews (Safe RL Survey[30], Safe RL Review[31]) synthesize progress across these dimensions. Within Algorithm Design, policy gradient variants for safety represent a particularly active line balancing gradient-based optimization with constraint satisfaction. Some works emphasize strict feasibility through adaptive trust regions or Lagrangian multipliers (Constrained Policy Optimization[32], Convergent Policy Optimization[24]), while others incorporate safety critics or distributional estimates to handle uncertainty (Safety Critic[44], Distributional Safety Critic[46]). Feasible Policy Optimization[0] sits within this cluster, focusing on maintaining strict constraint adherence throughout learning—a priority it shares with ActSafe[3] and Policy Bifurcation[6], though each employs distinct mechanisms for ensuring feasibility. Nearby, Constraint-Conditioned Policy[45] explores conditioning on varying constraint levels, offering flexibility that contrasts with the strict-guarantee emphasis of Feasible Policy Optimization[0]. These differences highlight an ongoing tension between flexibility, sample efficiency, and the strength of safety guarantees that algorithms can provide in practice.

Claimed Contributions

Region-wise policy optimization framework for safe RL

The authors propose a new policy update rule that relaxes the conventional requirement of strict constraint satisfaction in every iteration. Instead, each policy update only needs to expand the feasible region while improving the value function, which is less conservative than existing methods.

10 retrieved papers
Can Refute
Feasible Policy Optimization (FPO) algorithm

The authors introduce FPO, which maximizes the value function inside the feasible region and minimizes the feasibility function outside it. They prove these two sub-problems share a common optimal solution based on a tight bound derived on the constraint decay function.

10 retrieved papers
Tight bound on constraint decay function

The authors derive a new tight bound for the constraint decay function that extends prior results from CPO. This bound enables more accurate estimation of feasible regions compared to using the cost value function, as CDF is bounded and requires shorter trajectories.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Region-wise policy optimization framework for safe RL

The authors propose a new policy update rule that relaxes the conventional requirement of strict constraint satisfaction in every iteration. Instead, each policy update only needs to expand the feasible region while improving the value function, which is less conservative than existing methods.

Contribution

Feasible Policy Optimization (FPO) algorithm

The authors introduce FPO, which maximizes the value function inside the feasible region and minimizes the feasibility function outside it. They prove these two sub-problems share a common optimal solution based on a tight bound derived on the constraint decay function.

Contribution

Tight bound on constraint decay function

The authors derive a new tight bound for the constraint decay function that extends prior results from CPO. This bound enables more accurate estimation of feasible regions compared to using the cost value function, as CDF is bounded and requires shorter trajectories.