Feasible Policy Optimization for Safe Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Safe reinforcement learningfeasible policy iterationregion-wise policy optimizationconstraint decay function

Policy gradient methods serve as a cornerstone of reinforcement learning (RL), yet their extension to safe RL, where policies must strictly satisfy safety constraints, remains challenging. While existing methods enforce constraints in every policy update, we demonstrate that this is unnecessarily conservative. Instead, each update only needs to progressively expand the feasible region while improving the value function. Our proposed algorithm, namely feasible policy optimization (FPO), simultaneously achieves both objectives by solving a region-wise policy optimization problem. Specifically, FPO maximizes the value function inside the feasible region and minimizes the feasibility function outside it. We prove that these two sub-problems share a common optimal solution, which is obtained based on a tight bound we derive on the constraint decay function. Extensive experiments on the Safety-Gymnasium benchmark show that FPO achieves excellent constraint satisfaction while maintaining competitive task performance, striking a favorable balance between safety and return compared to state-of-the-art safe RL algorithms.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Feasible Policy Optimization (FPO), which relaxes the requirement that every policy update must satisfy constraints, instead progressively expanding the feasible region while improving value. It resides in the 'Policy Gradient Variants for Safety' leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that novel policy gradient formulations specifically tailored for strict safety remain an active but not overcrowded area of investigation.

The taxonomy reveals that FPO's leaf sits within 'Algorithm Design and Optimization Methods', adjacent to 'Primal-Dual and Lagrangian Methods' (six papers) and 'Barrier Functions and Reachability-Based Control' (three papers). These neighboring branches emphasize Lagrangian relaxation or control-theoretic guarantees, whereas FPO's region-wise optimization approach diverges by decoupling feasibility expansion from immediate constraint satisfaction. The taxonomy's scope notes clarify that policy gradient variants exclude primal-dual frameworks, positioning FPO as a distinct algorithmic paradigm rather than a refinement of existing Lagrangian schemes.

Among thirty candidates examined, the region-wise optimization framework (Contribution A) encountered three potentially refutable papers out of ten examined, indicating some prior work on progressive feasibility or trust-region methods. In contrast, the FPO algorithm itself (Contribution B) and the tight bound on constraint decay (Contribution C) each examined ten candidates with zero refutations, suggesting these specific technical elements appear more novel within the limited search scope. The analysis does not claim exhaustive coverage, so additional related work may exist beyond the top-thirty semantic matches.

Overall, the paper introduces a conceptually distinct approach to safe policy optimization, supported by a sparse taxonomy leaf and limited prior work overlap in the examined candidates. The region-wise framework shows some connection to existing trust-region ideas, while the algorithmic details and theoretical bounds appear less anticipated. These signals, drawn from a focused literature search, suggest moderate novelty, though a broader survey could reveal additional context.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Policy optimization with strict safety constraints in reinforcement learning. The field addresses how agents can learn high-performing policies while guaranteeing adherence to safety requirements during both training and deployment. The taxonomy reveals several major branches: Constraint Formulation and Theoretical Foundations establish the mathematical underpinnings (e.g., Lyapunov Safe RL[10], Safety-constrained MDPs[34]), while Algorithm Design and Optimization Methods develop practical solvers including policy gradient variants, primal-dual schemes (Accelerated Primal-Dual[25], Deterministic Primal-Dual[43]), and barrier-based techniques (Barrier Functions Safety[15]). Learning Paradigms and Data Efficiency explore offline, online, and hybrid strategies (Online Offline Safe[42]), and Multi-Agent and Scalability Extensions (Scalable Constrained Multi-agent[17]) tackle coordination under constraints. Domain Applications and Benchmarks provide testbeds (Safety Gymnasium[13]) spanning robotics (Safe Learning Robotics[29]), autonomous driving (Risk-Aware Autonomous Driving[21]), and industrial control (Blast Furnace Operation[33]). Integration with Formal Methods and Hybrid Approaches bridges symbolic verification with learning (Formal Methods Safety[26]), while Surveys and Comprehensive Reviews (Safe RL Survey[30], Safe RL Review[31]) synthesize progress across these dimensions. Within Algorithm Design, policy gradient variants for safety represent a particularly active line balancing gradient-based optimization with constraint satisfaction. Some works emphasize strict feasibility through adaptive trust regions or Lagrangian multipliers (Constrained Policy Optimization[32], Convergent Policy Optimization[24]), while others incorporate safety critics or distributional estimates to handle uncertainty (Safety Critic[44], Distributional Safety Critic[46]). Feasible Policy Optimization[0] sits within this cluster, focusing on maintaining strict constraint adherence throughout learning—a priority it shares with ActSafe[3] and Policy Bifurcation[6], though each employs distinct mechanisms for ensuring feasibility. Nearby, Constraint-Conditioned Policy[45] explores conditioning on varying constraint levels, offering flexibility that contrasts with the strict-guarantee emphasis of Feasible Policy Optimization[0]. These differences highlight an ongoing tension between flexibility, sample efficiency, and the strength of safety guarantees that algorithms can provide in practice.

Claimed Contributions

Region-wise policy optimization framework for safe RL

Can Refute

10 retrieved papers

The authors propose a new policy update rule that relaxes the conventional requirement of strict constraint satisfaction in every iteration. Instead, each policy update only needs to expand the feasible region while improving the value function, which is less conservative than existing methods.

10 retrieved papers

Can Refute

Feasible Policy Optimization (FPO) algorithm

10 retrieved papers

The authors introduce FPO, which maximizes the value function inside the feasible region and minimizes the feasibility function outside it. They prove these two sub-problems share a common optimal solution based on a tight bound derived on the constraint decay function.

10 retrieved papers

Tight bound on constraint decay function

10 retrieved papers

The authors derive a new tight bound for the constraint decay function that extends prior results from CPO. This bound enables more accurate estimation of feasible regions compared to using the cost value function, as CDF is bounded and requires shorter trajectories.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Policy Bifurcation in Safe Reinforcement Learning PDF

Zou Wenjun, Lyu, Yao, Wenjun Zou, Li Jie, Yao Lyu, Yang Yu-Jie, Jie Li, Li, Shengbo Eben, Yujie Yang, Duan, Jingliang, Shengbo Eben Li, Zhan, Xianyuan, Jingliang Duan, Liu, Jingjing, Xianyuan Zhan, Zhang Yaqin, Jingjing Liu, Li Keqiang, Ya-Qin Zhang, Keqiang Li (2024)

[45] Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning PDF

Yao, Yihang, Liu Zu-xin, Yi-Fan Yao, Cen, Zhepeng, Zuxin Liu, Zhu, Jiacheng, Zhepeng Cen, Yu, Wenhao, Jiacheng Zhu, Zhang, Tingnan, Wenhao Yu, Zhao Ding, Tingnan Zhang, Ding Zhao (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Region-wise policy optimization framework for safe RL

[64] Feasible policy iteration PDF

Can Refute

[72] Feasible Policy Iteration for Safe Reinforcement Learning PDF

Can Refute

[75] Feasible reachable policy iteration PDF

Can Refute

[9] Deep Reinforcement Learning PDF

Cannot Refute

[71] Synthesizing control barrier functions with feasible region iteration for safe reinforcement learning PDF

Cannot Refute

[73] Search or split: policy gradient with adaptive policy space PDF

Cannot Refute

[74] Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning PDF

Cannot Refute

[76] P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL PDF

Cannot Refute

[77] Safe model-based reinforcement learning with stability guarantees PDF

Cannot Refute

[78] Distributional soft actor-critic for decision-making in on-ramp merge scenarios PDF

Cannot Refute

Contribution

Feasible Policy Optimization (FPO) algorithm

[51] Maximizing Quadruped Velocity by Minimizing Energy PDF

Cannot Refute

[52] Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model PDF

Cannot Refute

[53] Infeasible and Critically Feasible Optimal Control PDF

Cannot Refute

[54] Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models PDF

Cannot Refute

[55] LTL-D*: Incrementally Optimal Replanning for Feasible and Infeasible Tasks in Linear Temporal Logic Specifications PDF

Cannot Refute

[56] A first-order regularized algorithm with complexity properties for the unconstrained and the convexly constrained low order-value optimization problem PDF

Cannot Refute

[57] A reference-point-method-based online proton treatment plan re-optimization strategy and a novel solution to planning constraint infeasibility problem PDF

Cannot Refute

[58] Second-Order Set-Valued Directional Derivatives of the Marginal Map in Parametric Vector Optimization Problems PDF

Cannot Refute

[59] Learning Nearly Decomposable Value Functions Via Communication Minimization PDF

Cannot Refute

[60] Solution Existence and Compactness Analysis for Nonsmooth Optimization Problems PDF

Cannot Refute

Contribution

Tight bound on constraint decay function

[61] Tight remainder-form decomposition functions with applications to constrained reachability and guaranteed state estimation PDF

Cannot Refute

[62] Feasible Policy Iteration With Guaranteed Safe Exploration PDF

Cannot Refute

[63] Tight Bounds for Online Convex Optimization with Adversarial Constraints PDF

Cannot Refute

[64] Feasible policy iteration PDF

Cannot Refute

[65] Safety-Critical Control using Optimal-decay Control Barrier Function with Guaranteed Point-wise Feasibility PDF

Cannot Refute

[66] Estimating P Wave Velocity and Attenuation Structures Using Full Waveform Inversion Based on a Time Domain ComplexâValued Viscoacoustic Wave Equation: The â¦ PDF

Cannot Refute

[67] A boundary method for attenuation correction in positron computed tomography PDF

Cannot Refute

[68] Tight remainder-form decomposition functions with applications to constrained reachability and interval observer design PDF

Cannot Refute

[69] Online Coreset Selection for Learning Dynamic Systems PDF

Cannot Refute

[70] Accelerating the Convergence of Poorly-Scaled Optimization Problems PDF

Cannot Refute

Feasible Policy Optimization for Safe Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Policy Bifurcation in Safe Reinforcement Learning PDF

[45] Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning PDF

Contribution Analysis

Region-wise policy optimization framework for safe RL

[64] Feasible policy iteration PDF

[72] Feasible Policy Iteration for Safe Reinforcement Learning PDF

[75] Feasible reachable policy iteration PDF

[9] Deep Reinforcement Learning PDF

[71] Synthesizing control barrier functions with feasible region iteration for safe reinforcement learning PDF

[73] Search or split: policy gradient with adaptive policy space PDF

[74] Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning PDF

[76] P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL PDF

[77] Safe model-based reinforcement learning with stability guarantees PDF

[78] Distributional soft actor-critic for decision-making in on-ramp merge scenarios PDF

Feasible Policy Optimization (FPO) algorithm

[51] Maximizing Quadruped Velocity by Minimizing Energy PDF

[52] Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model PDF

[53] Infeasible and Critically Feasible Optimal Control PDF

[54] Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models PDF

[55] LTL-D*: Incrementally Optimal Replanning for Feasible and Infeasible Tasks in Linear Temporal Logic Specifications PDF

[56] A first-order regularized algorithm with complexity properties for the unconstrained and the convexly constrained low order-value optimization problem PDF

[57] A reference-point-method-based online proton treatment plan re-optimization strategy and a novel solution to planning constraint infeasibility problem PDF

[58] Second-Order Set-Valued Directional Derivatives of the Marginal Map in Parametric Vector Optimization Problems PDF

[59] Learning Nearly Decomposable Value Functions Via Communication Minimization PDF

[60] Solution Existence and Compactness Analysis for Nonsmooth Optimization Problems PDF

Tight bound on constraint decay function

[61] Tight remainder-form decomposition functions with applications to constrained reachability and guaranteed state estimation PDF

[62] Feasible Policy Iteration With Guaranteed Safe Exploration PDF

[63] Tight Bounds for Online Convex Optimization with Adversarial Constraints PDF

[64] Feasible policy iteration PDF

[65] Safety-Critical Control using Optimal-decay Control Barrier Function with Guaranteed Point-wise Feasibility PDF

[66] Estimating P Wave Velocity and Attenuation Structures Using Full Waveform Inversion Based on a Time Domain ComplexâValued Viscoacoustic Wave Equation: The â¦ PDF

[67] A boundary method for attenuation correction in positron computed tomography PDF

[68] Tight remainder-form decomposition functions with applications to constrained reachability and interval observer design PDF

[69] Online Coreset Selection for Learning Dynamic Systems PDF

[70] Accelerating the Convergence of Poorly-Scaled Optimization Problems PDF

Table of Contents

[66] Estimating P Wave Velocity and Attenuation Structures Using Full Waveform Inversion Based on a Time Domain ComplexâValued Viscoacoustic Wave Equation: The â¦ PDF