Minimax Optimal Adversarial Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
episodic MDPs; adversarial RL; minimax-optimal regret bound
Abstract:

Consider episodic Markov decision processes (MDPs) with adversarially chosen transition kernels, where the transition kernel is adversarially chosen at each episode. Prior works have established regret upper bounds of O~(T+CP)\widetilde{\mathcal{O}}(\sqrt{T} + C^P), where TT is the number of episodes and CPC^P quantifies the degree of adversarial change in the transition dynamics. This regret bound may scale as large as O(T)\mathcal{O}(T), leading to a linear regret. This raises a fundamental question: Can sublinear regret be achieved under fully adversarial transition kernels? We answer this question affirmatively. First, we show that the optimal policy for MDPs with adversarial transition kernels must be history-dependent. We then design an algorithm of Adversarial Dynamics Follow-the-Regularized-Leader (AD-FTRL), and prove that it achieves a sublinear regret of O((SA)KT)\mathcal{O}(\sqrt{(|\mathcal{S}||\mathcal{A}|)^K T}), where KK is the horizon length, S|\mathcal{S}| is the number of states, and A|\mathcal{A}| is the number of actions. Such a regret cannot be achieved by simply solving this problem as a contextual bandit. We further construct a hard MDP instance and prove a matching lower bound on the regret, which thereby demonstrates the minimax optimality of our algorithm.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes sublinear regret bounds for episodic MDPs with fully adversarial transition kernels, introducing the AD-FTRL algorithm and proving history-dependent optimal policies are necessary. It resides in the 'Adversarial MDPs with Unknown Transitions' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers. This leaf focuses specifically on regret analysis under adversarial losses and unknown dynamics, distinguishing it from known-transition settings and bandit formulations that populate neighboring branches.

The taxonomy reveals this work sits within 'Theoretical Foundations and Regret Analysis,' one of eight major branches addressing adversarial RL. Neighboring leaves include 'Adversarial Restless Multi-Armed Bandits' (bandit feedback without full MDP structure) and 'Maximum Entropy and Robust RL Theory' (formal robustness proofs via MaxEnt). The sibling papers in the same leaf examine dynamic regret bounds and sample complexity under adversarial losses, but the taxonomy's scope notes explicitly exclude known-transition settings, positioning this work at the intersection of unknown dynamics and adversarial control where theoretical guarantees remain challenging.

Among 30 candidates examined, the first contribution (characterizing history-dependent optimal policies) shows one refutable candidate from 10 examined, suggesting some prior theoretical characterization exists in the limited search scope. The second contribution (AD-FTRL algorithm) and third contribution (minimax optimal bounds with matching lower bound) each examined 10 candidates with zero refutations, indicating these algorithmic and optimality results appear more novel within the searched literature. The analysis explicitly notes this reflects top-K semantic search plus citation expansion, not exhaustive coverage, so additional related work may exist beyond the examined set.

Given the sparse four-paper leaf and limited overlap detected across 30 candidates, the work appears to advance a relatively underexplored theoretical direction. The history-dependent policy characterization shows modest prior overlap, while the algorithmic and optimality contributions exhibit stronger novelty signals within the examined scope. However, the small search scale and narrow leaf population mean this assessment captures local novelty rather than field-wide positioning, and broader literature may contain additional relevant precedents not surfaced by semantic search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: reinforcement learning with adversarially chosen transition kernels. This field examines how agents can learn effective policies when the environment's dynamics are selected by an adversary, rather than being fixed or stochastic in a benign sense. The taxonomy reveals a rich structure spanning eight main branches. Theoretical Foundations and Regret Analysis investigates minimax optimality and regret bounds in adversarial MDPs, often under unknown transitions, as seen in works like Minimax Optimal Adversarial RL[0] and Dynamic Regret Adversarial MDPs[6]. Robustness Enhancement Methods focuses on training techniques that harden policies against perturbations, including adversarial regularization and distributionally robust approaches such as Distributionally Robust Policy[8]. Adversarial Attack Methods and Vulnerability Analysis explores how to craft effective attacks on trained agents, with studies like Robust DRL Adversarial Perturbations[1] and Adversarial Attacks Training Survey[3] characterizing threat models. Meanwhile, Domain-Specific Applications and Specialized Environments branches demonstrate how adversarial RL principles apply to cybersecurity, autonomous driving, and other real-world testbeds. Several active lines of work highlight contrasting emphases and open questions. One thread pursues tight regret guarantees for online learning in adversarial MDPs, balancing computational efficiency with statistical optimality; another thread emphasizes practical robustness via adversarial training or auxiliary models that anticipate worst-case perturbations. Minimax Optimal Adversarial RL[0] sits squarely within the theoretical branch on adversarial MDPs with unknown transitions, aiming to establish minimax optimal rates. It shares conceptual ground with Dynamic Regret Adversarial MDPs[6], which also tackles regret minimization under adversarial dynamics, and with No-Regret Online RL[41], which explores no-regret guarantees in online settings. Compared to these neighbors, Minimax Optimal Adversarial RL[0] appears to emphasize achieving the tightest possible bounds in the unknown-transition regime, whereas Dynamic Regret Adversarial MDPs[6] may focus more on time-varying adversaries. This positioning underscores ongoing debates about the trade-offs between sample complexity, computational tractability, and the strength of adversarial assumptions.

Claimed Contributions

Characterization of optimal policy under adversarial transitions

The authors establish that when transition kernels are chosen adversarially at each episode, the optimal policy must depend on the full history of observations rather than only the current state. This contrasts with standard MDPs where Markov policies are known to be optimal.

10 retrieved papers
Can Refute
AD-FTRL algorithm with sublinear regret guarantee

The authors design a Follow-the-Regularized-Leader algorithm that operates with bandit feedback and unknown adversarial transitions. The algorithm uses trajectory-level occupancy measures and importance sampling with a carefully designed regularization term to achieve sublinear regret without requiring knowledge of transition kernels.

10 retrieved papers
Minimax optimal regret bound with matching lower bound

The authors construct a hard MDP instance and prove a matching lower bound that demonstrates their algorithm achieves the minimax optimal regret. Their proof introduces a new analytical approach using composite hypothesis testing for handling adversarial transitions, providing a complete characterization of the fundamental difficulty of this problem.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Characterization of optimal policy under adversarial transitions

The authors establish that when transition kernels are chosen adversarially at each episode, the optimal policy must depend on the full history of observations rather than only the current state. This contrasts with standard MDPs where Markov policies are known to be optimal.

Contribution

AD-FTRL algorithm with sublinear regret guarantee

The authors design a Follow-the-Regularized-Leader algorithm that operates with bandit feedback and unknown adversarial transitions. The algorithm uses trajectory-level occupancy measures and importance sampling with a carefully designed regularization term to achieve sublinear regret without requiring knowledge of transition kernels.

Contribution

Minimax optimal regret bound with matching lower bound

The authors construct a hard MDP instance and prove a matching lower bound that demonstrates their algorithm achieves the minimax optimal regret. Their proof introduces a new analytical approach using composite hypothesis testing for handling adversarial transitions, providing a complete characterization of the fundamental difficulty of this problem.

Minimax Optimal Adversarial Reinforcement Learning | Novelty Validation