Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Large Language ModelAgentGuardianGuardrailSafety

While LLM agents can plan multi-step tasks, intervening at the planning stage—before any action is executed—is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release \texttt{Pre-Exec Bench}, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces three interconnected contributions addressing pre-execution safety for LLM agents: AuraGen (a synthetic data engine), Safiron (a foundational guardrail with cross-planner adapter), and Pre-Exec Bench (an evaluation benchmark). It resides in the Multi-Stage Guardrail Frameworks leaf, which contains four papers including Llamafirewall, TrustAgent Constitution, and TrustAgent. This leaf represents a moderately active research direction within the broader Guardrail Architectures and Enforcement Mechanisms branch, focusing on layered validation pipelines that intercept unsafe actions at multiple decision points before execution.

The taxonomy reveals that Multi-Stage Guardrail Frameworks sits alongside three sibling categories: Specification-Based Runtime Enforcement (two papers using formal languages), Constitution-Based Agent Frameworks (two papers embedding explicit safety principles), and Proactive and Predictive Enforcement (two papers employing probabilistic model checking). The paper's cross-planner adapter and multi-stage design connect it to constitution-based approaches, while its emphasis on pre-execution interception distinguishes it from runtime enforcement methods. Neighboring branches address complementary concerns: Safety Evaluation and Benchmarking (thirteen papers across four leaves) and Adaptive and Learning-Based Safety Mechanisms (three papers), suggesting the paper bridges architectural design with evaluation infrastructure.

Among twenty-two candidates examined, none clearly refute the three contributions. AuraGen's synthetic trajectory generation with controllable risk injection examined five candidates with zero refutations, suggesting novelty in combining benign synthesis, category-labeled risk insertion, and automated filtering. Safiron's cross-planner adapter and compact guardian model examined seven candidates with no overlapping prior work, indicating potential originality in unifying heterogeneous planner formats. Pre-Exec Bench examined ten candidates without refutation, though the comprehensive safety benchmark landscape (thirteen papers in the taxonomy) implies this contribution enters a more crowded evaluation space where incremental advances are common.

Based on the limited search scope of twenty-two semantically similar papers, the work appears to offer fresh perspectives on data generation and cross-planner unification, while the benchmark contribution aligns with established evaluation trends. The analysis does not cover exhaustive citation networks or domain-specific literature beyond top-K semantic matches, so definitive novelty claims require broader verification. The taxonomy context suggests the paper occupies a strategic position linking architectural innovation with evaluation infrastructure in a moderately mature research area.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: pre-execution safety guardrails for LLM-based agentic systems. The field has organized itself around ten major branches that collectively address how to prevent harmful agent actions before they occur. Guardrail Architectures and Enforcement Mechanisms explores the structural designs and multi-stage frameworks that intercept risky behaviors, while Safety Evaluation and Benchmarking develops datasets and metrics to measure guardrail effectiveness across diverse scenarios. Threat Models and Attack Vectors catalogs the adversarial landscape—from prompt injection to control-flow hijacking—that guardrails must defend against. Domain-Specific Safety Applications tailors protections to high-stakes environments such as robotics, healthcare, and web automation, whereas Human-Agent Interaction and Oversight examines how human feedback and constitutional principles can guide safe operation. Adaptive and Learning-Based Safety Mechanisms investigates reinforcement learning and fine-tuning approaches that allow guardrails to evolve, and Formal Verification and Logic-Based Safety applies rigorous mathematical methods to certify agent behavior. Multi-Agent Safety and Moderation addresses coordination and content filtering in systems with multiple interacting agents, while Foundational Concepts and Surveys provide overarching taxonomies and theoretical grounding. Finally, Auxiliary and Cross-Cutting Topics captures emerging themes like personalized safety profiles and real-time failure detection. Several active lines of work reveal key trade-offs between expressiveness and verifiability: adaptive mechanisms such as Safety Alignment RL[17] promise context-sensitive protection but complicate formal guarantees, whereas logic-based approaches offer provable correctness at the cost of reduced flexibility. Multi-stage frameworks have become particularly prominent, with works like Llamafirewall[1], TrustAgent Constitution[10], and TrustAgent[16] layering constitutional checks, plan verification, and execution monitoring to catch unsafe actions at multiple decision points. Foundational Guardrail[0] sits squarely within this multi-stage cluster, emphasizing a structured pipeline that integrates pre-execution filters with policy-driven oversight. Compared to TrustAgent Constitution[10], which foregrounds human-readable constitutional rules, Foundational Guardrail[0] appears to place greater weight on automated enforcement layers, while sharing TrustAgent[16]'s commitment to transparent, modular guardrail design. This positioning reflects a broader tension in the field: balancing the need for interpretable, human-aligned safety constraints against the demand for scalable, real-time protection in increasingly autonomous systems.

Claimed Contributions

AuraGen: Synthetic Data Engine for Risky Agent Trajectories

5 retrieved papers

AuraGen is a three-stage synthetic data generation pipeline that addresses data scarcity by producing large-scale, diverse, and controllable corpora of risky agent trajectories. It synthesizes benign trajectories, injects risks through four principled strategies (single-step, multi-step, new branch, and bridged branch), and applies automated quality assurance via a reward model.

5 retrieved papers

Safiron: Foundational Guardrail with Cross-Planner Adapter

7 retrieved papers

Safiron is a guardian model that combines a unified adapter (normalizing heterogeneous agent outputs) with a compact detection model. It flags risky cases, assigns fine-grained risk types, and generates explanations, trained via a two-stage recipe (supervised fine-tuning followed by GRPO-based reinforcement learning).

7 retrieved papers

Pre-Exec Bench: Benchmark for Pre-Execution Safety Evaluation

10 retrieved papers

Pre-Exec Bench is a benchmark designed specifically for evaluating planning-stage (pre-execution) safety in agentic systems. It is constructed through tool refinement, diverse trajectory generation, and two-phase human verification, providing realistic assessments of detection, categorization, explanation, and generalization capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Llamafirewall: An open source guardrail system for building secure ai agents PDF

Sa-hana Chennabasappa, Song Daniel, Cyrus Nikolaidis, MolnÃ¡r DÃ¡vid, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Zhou Wu, Rashnil Chaturvedi, Saxe, Joshua, Wu Zhou, Joshua Saxe (2025)

[10] Trustagent: Towards safe and trustworthy llm-based agents through agent constitution PDF

Hua, Wenyue, Yang Xian-jun, Wenyue Hua, Jin Ming-yu, Xianjun Yang, Li Zelong, Zelong Li, Cheng Wei, Tang Ruixiang, Yongfeng Zhang, Zhang, Yongfeng (2024)

[16] Trustagent: Towards safe and trustworthy llm-based agents PDF

Hua Wenyue, Xianjun Yang, Wenyue Hua, Mingyu Jin, Zelong Li, Wei Cheng, Ruixiang Tang, Yongfeng Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AuraGen: Synthetic Data Engine for Risky Agent Trajectories

[62] A survey on safety-critical driving scenario generationâa methodological perspective PDF

Cannot Refute

[63] Decoupled diffusion sparks adaptive scene generation PDF

Cannot Refute

[64] TeraSim-World: Worldwide Safety-Critical Data Synthesis for End-to-End Autonomous Driving PDF

Cannot Refute

[65] Predicting lane-changing risk considering the class imbalance problem: a control method for synthetic samples PDF

Cannot Refute

[66] Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios PDF

Cannot Refute

Contribution

Safiron: Foundational Guardrail with Cross-Planner Adapter

[45] PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents PDF

Cannot Refute

[46] AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions PDF

Cannot Refute

[51] Adapting to Planning Failures in Lifelong Multi-Agent Path Finding PDF

Cannot Refute

[52] Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems PDF

Cannot Refute

[53] BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters PDF

Cannot Refute

[54] Deploying Agentic AI in Enterprise Environments PDF

Cannot Refute

[55] Agentops pattern catalogue: Architectural patterns for safe and observable operations of foundation model-based agents PDF

Cannot Refute

Contribution

Pre-Exec Bench: Benchmark for Pre-Execution Safety Evaluation

[5] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

Cannot Refute

[14] Safeagentbench: A benchmark for safe task planning of embodied llm agents PDF

Cannot Refute

[30] Agentauditor: Human-level safety and security evaluation for llm agents PDF

Cannot Refute

[43] Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values PDF

Cannot Refute

[56] Evaluation and benchmarking of llm agents: A survey PDF

Cannot Refute

[57] Why do multiagent systems fail? PDF

Cannot Refute

[58] Safety gymnasium: A unified safe reinforcement learning benchmark PDF

Cannot Refute

[59] Safebench: A benchmarking platform for safety evaluation of autonomous vehicles PDF

Cannot Refute

[60] Redcode: Risky code execution and generation benchmark for code agents PDF

Cannot Refute

[61] Guard: A safe reinforcement learning benchmark PDF

Cannot Refute

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Llamafirewall: An open source guardrail system for building secure ai agents PDF

[10] Trustagent: Towards safe and trustworthy llm-based agents through agent constitution PDF

[16] Trustagent: Towards safe and trustworthy llm-based agents PDF

Contribution Analysis

AuraGen: Synthetic Data Engine for Risky Agent Trajectories

[62] A survey on safety-critical driving scenario generationâa methodological perspective PDF

[63] Decoupled diffusion sparks adaptive scene generation PDF

[64] TeraSim-World: Worldwide Safety-Critical Data Synthesis for End-to-End Autonomous Driving PDF

[65] Predicting lane-changing risk considering the class imbalance problem: a control method for synthetic samples PDF

[66] Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios PDF

Safiron: Foundational Guardrail with Cross-Planner Adapter

[45] PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents PDF

[46] AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions PDF

[51] Adapting to Planning Failures in Lifelong Multi-Agent Path Finding PDF

[52] Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems PDF

[53] BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters PDF

[54] Deploying Agentic AI in Enterprise Environments PDF

[55] Agentops pattern catalogue: Architectural patterns for safe and observable operations of foundation model-based agents PDF

Pre-Exec Bench: Benchmark for Pre-Execution Safety Evaluation

[5] Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety PDF

[14] Safeagentbench: A benchmark for safe task planning of embodied llm agents PDF

[30] Agentauditor: Human-level safety and security evaluation for llm agents PDF

[43] Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values PDF

[56] Evaluation and benchmarking of llm agents: A survey PDF

[57] Why do multiagent systems fail? PDF

[58] Safety gymnasium: A unified safe reinforcement learning benchmark PDF

[59] Safebench: A benchmarking platform for safety evaluation of autonomous vehicles PDF

[60] Redcode: Risky code execution and generation benchmark for code agents PDF

[61] Guard: A safe reinforcement learning benchmark PDF

Table of Contents

[62] A survey on safety-critical driving scenario generationâa methodological perspective PDF