Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
Overview
Overall Novelty Assessment
The paper introduces a backdoor attack targeting RL agents through malicious simulator dynamics, proposing the 'Daze' method that operates without altering or observing agent rewards. According to the taxonomy, this work occupies the 'Simulator-Based Backdoor Implantation' leaf under 'Training-Time Environment Poisoning Attacks', where it appears as the sole paper in its category. This positioning suggests the paper addresses a relatively sparse research direction within the broader field of RL security, which encompasses 19 papers across multiple attack vectors including reward manipulation, federated learning poisoning, and multi-agent scenarios.
The taxonomy reveals that neighboring research directions focus on reward and transition manipulation (6 papers across white-box and black-box variants), implicit poisoning via agent interaction (1 paper), and supply-chain backdoor attacks (1 paper). The paper's emphasis on simulator dynamics distinguishes it from sibling categories that assume direct reward access or policy-level interventions. The scope note for its leaf explicitly excludes 'reward-based poisoning and supply-chain attacks', positioning this work at the intersection of environment manipulation and stealthy backdoor implantation without traditional reward-level control assumptions.
Among 26 candidates examined across three contributions, the analysis reveals mixed novelty signals. The novel threat model contribution examined 9 candidates with 1 appearing to provide overlapping prior work, suggesting some existing exploration of simulator-based attacks. The Daze attack method examined 7 candidates with none clearly refuting it, indicating potential technical novelty in the specific approach. The real hardware transfer claim examined 10 candidates with 1 refutable match, suggesting prior demonstrations of sim-to-real backdoor transfer exist. These statistics reflect a limited search scope rather than exhaustive coverage of the literature.
Based on the top-26 semantic matches examined, the work appears to occupy a relatively underexplored niche within RL security, particularly regarding simulator-level manipulation without reward observation. However, the limited search scope and presence of refutable candidates for two of three contributions suggest caution in assessing absolute novelty. The taxonomy structure indicates this is part of a growing but still sparse research area, with the paper's unique positioning as the sole member of its leaf potentially reflecting either genuine novelty or incomplete taxonomy coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a new threat model where adversarial developers release malicious simulators that implant backdoors into RL agents during training. Unlike traditional backdoor attacks, this threat model assumes the adversary cannot alter or observe agent rewards, only manipulate state information and transition dynamics.
The authors design Daze, a reward-free backdoor attack that punishes agents for ignoring target actions by forcing random transitions in dazed states. They provide formal proofs that policies optimal in the adversarial MDP also optimize attack success and stealth objectives without requiring reward access.
The authors demonstrate backdoor attacks successfully triggering harmful behavior on physical robots (Turtlebot and Fetch platforms) in custom environments, representing the first known instance of RL backdoor attacks operating on real hardware rather than only in simulation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Novel threat model targeting malicious simulators in RL
The authors propose a new threat model where adversarial developers release malicious simulators that implant backdoors into RL agents during training. Unlike traditional backdoor attacks, this threat model assumes the adversary cannot alter or observe agent rewards, only manipulate state information and transition dynamics.
[5] Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning PDF
[28] Adversarial reinforcement learning based data poisoning attacks defense for task-oriented multi-user semantic communication PDF
[29] Adversarial Machine Learning in Cybersecurity: A Review on Defending Against AI-Driven Attacks PDF
[30] Learning to attack federated learning: A model-based reinforcement learning attack framework PDF
[31] Adversarial policy training against deep reinforcement learning PDF
[32] Repetitive Backdoor Attacks and Countermeasures for Smart Grid Reinforcement Incremental Learning PDF
[33] Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning PDF
[34] Watch your back: Backdoor attacks in deep reinforcement learning-based autonomous vehicle control systems PDF
[35] Stealthy Backdoor Attack with Adversarial Training PDF
Daze attack with theoretical guarantees
The authors design Daze, a reward-free backdoor attack that punishes agents for ignoring target actions by forcing random transitions in dazed states. They provide formal proofs that policies optimal in the adversarial MDP also optimize attack success and stealth objectives without requiring reward access.
[20] Sleepernets: Universal backdoor poisoning attacks against reinforcement learning agents PDF
[21] Stop-and-go: Exploring backdoor attacks on deep reinforcement learning-based traffic congestion control systems PDF
[22] Pilot Backdoor Attack against Deep Reinforcement Learning Empowered Intelligent Reflection Surface for Smart Radio PDF
[23] BadVim: Unveiling Backdoor Threats in Visual State Space Model PDF
[24] Provable defense against backdoor policies in reinforcement learning PDF
[25] Temporal Logic-Based Multi-Vehicle Backdoor Attacks against Offline RL Agents in End-to-end Autonomous Driving PDF
[26] BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning PDF
First demonstration of RL backdoor transfer to real robotic hardware
The authors demonstrate backdoor attacks successfully triggering harmful behavior on physical robots (Turtlebot and Fetch platforms) in custom environments, representing the first known instance of RL backdoor attacks operating on real hardware rather than only in simulation.