Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Poisoning AttacksBackdoor AttacksReinforcement LearningDeep Reinforcement LearningRobotics

Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined "trigger", leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack "Daze" which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a backdoor attack targeting RL agents through malicious simulator dynamics, proposing the 'Daze' method that operates without altering or observing agent rewards. According to the taxonomy, this work occupies the 'Simulator-Based Backdoor Implantation' leaf under 'Training-Time Environment Poisoning Attacks', where it appears as the sole paper in its category. This positioning suggests the paper addresses a relatively sparse research direction within the broader field of RL security, which encompasses 19 papers across multiple attack vectors including reward manipulation, federated learning poisoning, and multi-agent scenarios.

The taxonomy reveals that neighboring research directions focus on reward and transition manipulation (6 papers across white-box and black-box variants), implicit poisoning via agent interaction (1 paper), and supply-chain backdoor attacks (1 paper). The paper's emphasis on simulator dynamics distinguishes it from sibling categories that assume direct reward access or policy-level interventions. The scope note for its leaf explicitly excludes 'reward-based poisoning and supply-chain attacks', positioning this work at the intersection of environment manipulation and stealthy backdoor implantation without traditional reward-level control assumptions.

Among 26 candidates examined across three contributions, the analysis reveals mixed novelty signals. The novel threat model contribution examined 9 candidates with 1 appearing to provide overlapping prior work, suggesting some existing exploration of simulator-based attacks. The Daze attack method examined 7 candidates with none clearly refuting it, indicating potential technical novelty in the specific approach. The real hardware transfer claim examined 10 candidates with 1 refutable match, suggesting prior demonstrations of sim-to-real backdoor transfer exist. These statistics reflect a limited search scope rather than exhaustive coverage of the literature.

Based on the top-26 semantic matches examined, the work appears to occupy a relatively underexplored niche within RL security, particularly regarding simulator-level manipulation without reward observation. However, the limited search scope and presence of refutable candidates for two of three contributions suggest caution in assessing absolute novelty. The taxonomy structure indicates this is part of a growing but still sparse research area, with the paper's unique positioning as the sole member of its leaf potentially reflecting either genuine novelty or incomplete taxonomy coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Backdoor attacks in reinforcement learning through malicious simulators. The field structure reflects a broad concern with adversarial manipulation of RL training pipelines, organized into several main branches. Training-Time Environment Poisoning Attacks examine how adversaries can corrupt the simulator or environment dynamics to embed hidden triggers, causing agents to misbehave under specific conditions while appearing normal otherwise. Federated Learning Poisoning Attacks address distributed settings where multiple parties contribute to training, enabling attackers to inject malicious updates that propagate across the federation. Multi-Agent and Population-Level Attacks explore scenarios involving interactions among multiple learners, where poisoning one agent can cascade through a population. Defense and Resilience Mechanisms investigate detection methods, robust training procedures, and architectural safeguards to mitigate these threats. Simulation and Testbed Environments provide controlled platforms for evaluating attack and defense strategies, while Related Attack Domains connect RL poisoning to broader adversarial machine learning contexts such as supervised learning backdoors and recommendation system manipulation. Several active lines of work highlight contrasting trade-offs between stealthiness, transferability, and the attacker's level of control. Some studies focus on black-box online poisoning where the adversary has limited access to the training process, as in Online Poisoning Black-box[1], while others assume stronger capabilities such as direct reward manipulation in federated settings, exemplified by Reward Poisoning Federated[2]. Meta-learning contexts introduce additional complexity, with Meta-RL Poisoning[3] showing how backdoors can persist across task distributions. The original paper, Untrusted Simulators[0], sits within the simulator-based backdoor implantation cluster, emphasizing scenarios where the environment itself is compromised rather than relying solely on reward or policy-level interventions. This contrasts with works like Policy Teaching Poisoning[5] and TrojanForge[6], which manipulate demonstrations or model parameters directly, underscoring an open question about which attack surface—simulator dynamics, reward signals, or policy updates—offers adversaries the most leverage while remaining hardest to detect.

Claimed Contributions

Novel threat model targeting malicious simulators in RL

Can Refute

9 retrieved papers

The authors propose a new threat model where adversarial developers release malicious simulators that implant backdoors into RL agents during training. Unlike traditional backdoor attacks, this threat model assumes the adversary cannot alter or observe agent rewards, only manipulate state information and transition dynamics.

9 retrieved papers

Can Refute

Daze attack with theoretical guarantees

7 retrieved papers

The authors design Daze, a reward-free backdoor attack that punishes agents for ignoring target actions by forcing random transitions in dazed states. They provide formal proofs that policies optimal in the adversarial MDP also optimize attack success and stealth objectives without requiring reward access.

7 retrieved papers

First demonstration of RL backdoor transfer to real robotic hardware

Can Refute

10 retrieved papers

The authors demonstrate backdoor attacks successfully triggering harmful behavior on physical robots (Turtlebot and Fetch platforms) in custom environments, representing the first known instance of RL backdoor attacks operating on real hardware rather than only in simulation.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel threat model targeting malicious simulators in RL

[5] Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning PDF

Can Refute

[28] Adversarial reinforcement learning based data poisoning attacks defense for task-oriented multi-user semantic communication PDF

Cannot Refute

[29] Adversarial Machine Learning in Cybersecurity: A Review on Defending Against AI-Driven Attacks PDF

Cannot Refute

[30] Learning to attack federated learning: A model-based reinforcement learning attack framework PDF

Cannot Refute

[31] Adversarial policy training against deep reinforcement learning PDF

Cannot Refute

[32] Repetitive Backdoor Attacks and Countermeasures for Smart Grid Reinforcement Incremental Learning PDF

Cannot Refute

[33] Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning PDF

Cannot Refute

[34] Watch your back: Backdoor attacks in deep reinforcement learning-based autonomous vehicle control systems PDF

Cannot Refute

[35] Stealthy Backdoor Attack with Adversarial Training PDF

Cannot Refute

Contribution

Daze attack with theoretical guarantees

[20] Sleepernets: Universal backdoor poisoning attacks against reinforcement learning agents PDF

Cannot Refute

[21] Stop-and-go: Exploring backdoor attacks on deep reinforcement learning-based traffic congestion control systems PDF

Cannot Refute

[22] Pilot Backdoor Attack against Deep Reinforcement Learning Empowered Intelligent Reflection Surface for Smart Radio PDF

Cannot Refute

[23] BadVim: Unveiling Backdoor Threats in Visual State Space Model PDF

Cannot Refute

[24] Provable defense against backdoor policies in reinforcement learning PDF

Cannot Refute

[25] Temporal Logic-Based Multi-Vehicle Backdoor Attacks against Offline RL Agents in End-to-end Autonomous Driving PDF

Cannot Refute

[26] BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning PDF

Cannot Refute

Contribution

First demonstration of RL backdoor transfer to real robotic hardware

[42] TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World PDF

Can Refute

[36] Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects PDF

Cannot Refute

[37] Towards physical world backdoor attacks against skeleton action recognition PDF

Cannot Refute

[38] Towards backdoor attacks against lidar object detection in autonomous driving PDF

Cannot Refute

[39] Trojanrobot: Physical-world backdoor attacks against vlm-based robotic manipulation PDF

Cannot Refute

[40] Flashy Backdoor: Real-world Environment Backdoor Attack on SNNs with DVS Cameras PDF

Cannot Refute

[41] A Critical Survey of Backdoor and Data-Poisoning Training-Time Attacks in Robotic and AI-CPS PDF

Cannot Refute

[43] Smart and robust motion planning for multi-robot systems with attacks PDF

Cannot Refute

[44] MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection Systems PDF

Cannot Refute

[45] Security of Deep Reinforcement Learning for Autonomous Driving: A Survey PDF

Cannot Refute

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Novel threat model targeting malicious simulators in RL

[5] Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning PDF

[28] Adversarial reinforcement learning based data poisoning attacks defense for task-oriented multi-user semantic communication PDF

[29] Adversarial Machine Learning in Cybersecurity: A Review on Defending Against AI-Driven Attacks PDF

[30] Learning to attack federated learning: A model-based reinforcement learning attack framework PDF

[31] Adversarial policy training against deep reinforcement learning PDF

[32] Repetitive Backdoor Attacks and Countermeasures for Smart Grid Reinforcement Incremental Learning PDF

[33] Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning PDF

[34] Watch your back: Backdoor attacks in deep reinforcement learning-based autonomous vehicle control systems PDF

[35] Stealthy Backdoor Attack with Adversarial Training PDF

Daze attack with theoretical guarantees

[20] Sleepernets: Universal backdoor poisoning attacks against reinforcement learning agents PDF

[21] Stop-and-go: Exploring backdoor attacks on deep reinforcement learning-based traffic congestion control systems PDF

[22] Pilot Backdoor Attack against Deep Reinforcement Learning Empowered Intelligent Reflection Surface for Smart Radio PDF

[23] BadVim: Unveiling Backdoor Threats in Visual State Space Model PDF

[24] Provable defense against backdoor policies in reinforcement learning PDF

[25] Temporal Logic-Based Multi-Vehicle Backdoor Attacks against Offline RL Agents in End-to-end Autonomous Driving PDF

[26] BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning PDF

First demonstration of RL backdoor transfer to real robotic hardware

[42] TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World PDF

[36] Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects PDF

[37] Towards physical world backdoor attacks against skeleton action recognition PDF

[38] Towards backdoor attacks against lidar object detection in autonomous driving PDF

[39] Trojanrobot: Physical-world backdoor attacks against vlm-based robotic manipulation PDF

[40] Flashy Backdoor: Real-world Environment Backdoor Attack on SNNs with DVS Cameras PDF

[41] A Critical Survey of Backdoor and Data-Poisoning Training-Time Attacks in Robotic and AI-CPS PDF

[43] Smart and robust motion planning for multi-robot systems with attacks PDF

[44] MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection Systems PDF

[45] Security of Deep Reinforcement Learning for Autonomous Driving: A Survey PDF

Table of Contents