Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelAgentGuardianGuardrailSafety
Abstract:

While LLM agents can plan multi-step tasks, intervening at the planning stage—before any action is executed—is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release \texttt{Pre-Exec Bench}, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces three interconnected contributions addressing pre-execution safety for LLM agents: AuraGen (a synthetic data engine), Safiron (a foundational guardrail with cross-planner adapter), and Pre-Exec Bench (an evaluation benchmark). It resides in the Multi-Stage Guardrail Frameworks leaf, which contains four papers including Llamafirewall, TrustAgent Constitution, and TrustAgent. This leaf represents a moderately active research direction within the broader Guardrail Architectures and Enforcement Mechanisms branch, focusing on layered validation pipelines that intercept unsafe actions at multiple decision points before execution.

The taxonomy reveals that Multi-Stage Guardrail Frameworks sits alongside three sibling categories: Specification-Based Runtime Enforcement (two papers using formal languages), Constitution-Based Agent Frameworks (two papers embedding explicit safety principles), and Proactive and Predictive Enforcement (two papers employing probabilistic model checking). The paper's cross-planner adapter and multi-stage design connect it to constitution-based approaches, while its emphasis on pre-execution interception distinguishes it from runtime enforcement methods. Neighboring branches address complementary concerns: Safety Evaluation and Benchmarking (thirteen papers across four leaves) and Adaptive and Learning-Based Safety Mechanisms (three papers), suggesting the paper bridges architectural design with evaluation infrastructure.

Among twenty-two candidates examined, none clearly refute the three contributions. AuraGen's synthetic trajectory generation with controllable risk injection examined five candidates with zero refutations, suggesting novelty in combining benign synthesis, category-labeled risk insertion, and automated filtering. Safiron's cross-planner adapter and compact guardian model examined seven candidates with no overlapping prior work, indicating potential originality in unifying heterogeneous planner formats. Pre-Exec Bench examined ten candidates without refutation, though the comprehensive safety benchmark landscape (thirteen papers in the taxonomy) implies this contribution enters a more crowded evaluation space where incremental advances are common.

Based on the limited search scope of twenty-two semantically similar papers, the work appears to offer fresh perspectives on data generation and cross-planner unification, while the benchmark contribution aligns with established evaluation trends. The analysis does not cover exhaustive citation networks or domain-specific literature beyond top-K semantic matches, so definitive novelty claims require broader verification. The taxonomy context suggests the paper occupies a strategic position linking architectural innovation with evaluation infrastructure in a moderately mature research area.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: pre-execution safety guardrails for LLM-based agentic systems. The field has organized itself around ten major branches that collectively address how to prevent harmful agent actions before they occur. Guardrail Architectures and Enforcement Mechanisms explores the structural designs and multi-stage frameworks that intercept risky behaviors, while Safety Evaluation and Benchmarking develops datasets and metrics to measure guardrail effectiveness across diverse scenarios. Threat Models and Attack Vectors catalogs the adversarial landscape—from prompt injection to control-flow hijacking—that guardrails must defend against. Domain-Specific Safety Applications tailors protections to high-stakes environments such as robotics, healthcare, and web automation, whereas Human-Agent Interaction and Oversight examines how human feedback and constitutional principles can guide safe operation. Adaptive and Learning-Based Safety Mechanisms investigates reinforcement learning and fine-tuning approaches that allow guardrails to evolve, and Formal Verification and Logic-Based Safety applies rigorous mathematical methods to certify agent behavior. Multi-Agent Safety and Moderation addresses coordination and content filtering in systems with multiple interacting agents, while Foundational Concepts and Surveys provide overarching taxonomies and theoretical grounding. Finally, Auxiliary and Cross-Cutting Topics captures emerging themes like personalized safety profiles and real-time failure detection. Several active lines of work reveal key trade-offs between expressiveness and verifiability: adaptive mechanisms such as Safety Alignment RL[17] promise context-sensitive protection but complicate formal guarantees, whereas logic-based approaches offer provable correctness at the cost of reduced flexibility. Multi-stage frameworks have become particularly prominent, with works like Llamafirewall[1], TrustAgent Constitution[10], and TrustAgent[16] layering constitutional checks, plan verification, and execution monitoring to catch unsafe actions at multiple decision points. Foundational Guardrail[0] sits squarely within this multi-stage cluster, emphasizing a structured pipeline that integrates pre-execution filters with policy-driven oversight. Compared to TrustAgent Constitution[10], which foregrounds human-readable constitutional rules, Foundational Guardrail[0] appears to place greater weight on automated enforcement layers, while sharing TrustAgent[16]'s commitment to transparent, modular guardrail design. This positioning reflects a broader tension in the field: balancing the need for interpretable, human-aligned safety constraints against the demand for scalable, real-time protection in increasingly autonomous systems.

Claimed Contributions

AuraGen: Synthetic Data Engine for Risky Agent Trajectories

AuraGen is a three-stage synthetic data generation pipeline that addresses data scarcity by producing large-scale, diverse, and controllable corpora of risky agent trajectories. It synthesizes benign trajectories, injects risks through four principled strategies (single-step, multi-step, new branch, and bridged branch), and applies automated quality assurance via a reward model.

5 retrieved papers
Safiron: Foundational Guardrail with Cross-Planner Adapter

Safiron is a guardian model that combines a unified adapter (normalizing heterogeneous agent outputs) with a compact detection model. It flags risky cases, assigns fine-grained risk types, and generates explanations, trained via a two-stage recipe (supervised fine-tuning followed by GRPO-based reinforcement learning).

7 retrieved papers
Pre-Exec Bench: Benchmark for Pre-Execution Safety Evaluation

Pre-Exec Bench is a benchmark designed specifically for evaluating planning-stage (pre-execution) safety in agentic systems. It is constructed through tool refinement, diverse trajectory generation, and two-phase human verification, providing realistic assessments of detection, categorization, explanation, and generalization capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AuraGen: Synthetic Data Engine for Risky Agent Trajectories

AuraGen is a three-stage synthetic data generation pipeline that addresses data scarcity by producing large-scale, diverse, and controllable corpora of risky agent trajectories. It synthesizes benign trajectories, injects risks through four principled strategies (single-step, multi-step, new branch, and bridged branch), and applies automated quality assurance via a reward model.

Contribution

Safiron: Foundational Guardrail with Cross-Planner Adapter

Safiron is a guardian model that combines a unified adapter (normalizing heterogeneous agent outputs) with a compact detection model. It flags risky cases, assigns fine-grained risk types, and generates explanations, trained via a two-stage recipe (supervised fine-tuning followed by GRPO-based reinforcement learning).

Contribution

Pre-Exec Bench: Benchmark for Pre-Execution Safety Evaluation

Pre-Exec Bench is a benchmark designed specifically for evaluating planning-stage (pre-execution) safety in agentic systems. It is constructed through tool refinement, diverse trajectory generation, and two-phase human verification, providing realistic assessments of detection, categorization, explanation, and generalization capabilities.