Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Poisoning AttacksBackdoor AttacksReinforcement LearningDeep Reinforcement LearningRobotics
Abstract:

Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined "trigger", leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack "Daze" which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a backdoor attack targeting RL agents through malicious simulator dynamics, proposing the 'Daze' method that operates without altering or observing agent rewards. According to the taxonomy, this work occupies the 'Simulator-Based Backdoor Implantation' leaf under 'Training-Time Environment Poisoning Attacks', where it appears as the sole paper in its category. This positioning suggests the paper addresses a relatively sparse research direction within the broader field of RL security, which encompasses 19 papers across multiple attack vectors including reward manipulation, federated learning poisoning, and multi-agent scenarios.

The taxonomy reveals that neighboring research directions focus on reward and transition manipulation (6 papers across white-box and black-box variants), implicit poisoning via agent interaction (1 paper), and supply-chain backdoor attacks (1 paper). The paper's emphasis on simulator dynamics distinguishes it from sibling categories that assume direct reward access or policy-level interventions. The scope note for its leaf explicitly excludes 'reward-based poisoning and supply-chain attacks', positioning this work at the intersection of environment manipulation and stealthy backdoor implantation without traditional reward-level control assumptions.

Among 26 candidates examined across three contributions, the analysis reveals mixed novelty signals. The novel threat model contribution examined 9 candidates with 1 appearing to provide overlapping prior work, suggesting some existing exploration of simulator-based attacks. The Daze attack method examined 7 candidates with none clearly refuting it, indicating potential technical novelty in the specific approach. The real hardware transfer claim examined 10 candidates with 1 refutable match, suggesting prior demonstrations of sim-to-real backdoor transfer exist. These statistics reflect a limited search scope rather than exhaustive coverage of the literature.

Based on the top-26 semantic matches examined, the work appears to occupy a relatively underexplored niche within RL security, particularly regarding simulator-level manipulation without reward observation. However, the limited search scope and presence of refutable candidates for two of three contributions suggest caution in assessing absolute novelty. The taxonomy structure indicates this is part of a growing but still sparse research area, with the paper's unique positioning as the sole member of its leaf potentially reflecting either genuine novelty or incomplete taxonomy coverage.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
26
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Backdoor attacks in reinforcement learning through malicious simulators. The field structure reflects a broad concern with adversarial manipulation of RL training pipelines, organized into several main branches. Training-Time Environment Poisoning Attacks examine how adversaries can corrupt the simulator or environment dynamics to embed hidden triggers, causing agents to misbehave under specific conditions while appearing normal otherwise. Federated Learning Poisoning Attacks address distributed settings where multiple parties contribute to training, enabling attackers to inject malicious updates that propagate across the federation. Multi-Agent and Population-Level Attacks explore scenarios involving interactions among multiple learners, where poisoning one agent can cascade through a population. Defense and Resilience Mechanisms investigate detection methods, robust training procedures, and architectural safeguards to mitigate these threats. Simulation and Testbed Environments provide controlled platforms for evaluating attack and defense strategies, while Related Attack Domains connect RL poisoning to broader adversarial machine learning contexts such as supervised learning backdoors and recommendation system manipulation. Several active lines of work highlight contrasting trade-offs between stealthiness, transferability, and the attacker's level of control. Some studies focus on black-box online poisoning where the adversary has limited access to the training process, as in Online Poisoning Black-box[1], while others assume stronger capabilities such as direct reward manipulation in federated settings, exemplified by Reward Poisoning Federated[2]. Meta-learning contexts introduce additional complexity, with Meta-RL Poisoning[3] showing how backdoors can persist across task distributions. The original paper, Untrusted Simulators[0], sits within the simulator-based backdoor implantation cluster, emphasizing scenarios where the environment itself is compromised rather than relying solely on reward or policy-level interventions. This contrasts with works like Policy Teaching Poisoning[5] and TrojanForge[6], which manipulate demonstrations or model parameters directly, underscoring an open question about which attack surface—simulator dynamics, reward signals, or policy updates—offers adversaries the most leverage while remaining hardest to detect.

Claimed Contributions

Novel threat model targeting malicious simulators in RL

The authors propose a new threat model where adversarial developers release malicious simulators that implant backdoors into RL agents during training. Unlike traditional backdoor attacks, this threat model assumes the adversary cannot alter or observe agent rewards, only manipulate state information and transition dynamics.

9 retrieved papers
Can Refute
Daze attack with theoretical guarantees

The authors design Daze, a reward-free backdoor attack that punishes agents for ignoring target actions by forcing random transitions in dazed states. They provide formal proofs that policies optimal in the adversarial MDP also optimize attack success and stealth objectives without requiring reward access.

7 retrieved papers
First demonstration of RL backdoor transfer to real robotic hardware

The authors demonstrate backdoor attacks successfully triggering harmful behavior on physical robots (Turtlebot and Fetch platforms) in custom environments, representing the first known instance of RL backdoor attacks operating on real hardware rather than only in simulation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel threat model targeting malicious simulators in RL

The authors propose a new threat model where adversarial developers release malicious simulators that implant backdoors into RL agents during training. Unlike traditional backdoor attacks, this threat model assumes the adversary cannot alter or observe agent rewards, only manipulate state information and transition dynamics.

Contribution

Daze attack with theoretical guarantees

The authors design Daze, a reward-free backdoor attack that punishes agents for ignoring target actions by forcing random transitions in dazed states. They provide formal proofs that policies optimal in the adversarial MDP also optimize attack success and stealth objectives without requiring reward access.

Contribution

First demonstration of RL backdoor transfer to real robotic hardware

The authors demonstrate backdoor attacks successfully triggering harmful behavior on physical robots (Turtlebot and Fetch platforms) in custom environments, representing the first known instance of RL backdoor attacks operating on real hardware rather than only in simulation.