Minimax Optimal Adversarial Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

episodic MDPs; adversarial RL; minimax-optimal regret bound

Consider episodic Markov decision processes (MDPs) with adversarially chosen transition kernels, where the transition kernel is adversarially chosen at each episode. Prior works have established regret upper bounds of $\widetilde{\mathcal{O}}(\sqrt{T} + C^P)$ , where $T$ is the number of episodes and $C^P$ quantifies the degree of adversarial change in the transition dynamics. This regret bound may scale as large as $\mathcal{O}(T)$ , leading to a linear regret. This raises a fundamental question: Can sublinear regret be achieved under fully adversarial transition kernels? We answer this question affirmatively. First, we show that the optimal policy for MDPs with adversarial transition kernels must be history-dependent. We then design an algorithm of Adversarial Dynamics Follow-the-Regularized-Leader (AD-FTRL), and prove that it achieves a sublinear regret of $\mathcal{O}(\sqrt{(|\mathcal{S}||\mathcal{A}|)^K T})$ , where $K$ is the horizon length, $|\mathcal{S}|$ is the number of states, and $|\mathcal{A}|$ is the number of actions. Such a regret cannot be achieved by simply solving this problem as a contextual bandit. We further construct a hard MDP instance and prove a matching lower bound on the regret, which thereby demonstrates the minimax optimality of our algorithm.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes sublinear regret bounds for episodic MDPs with fully adversarial transition kernels, introducing the AD-FTRL algorithm and proving history-dependent optimal policies are necessary. It resides in the 'Adversarial MDPs with Unknown Transitions' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers. This leaf focuses specifically on regret analysis under adversarial losses and unknown dynamics, distinguishing it from known-transition settings and bandit formulations that populate neighboring branches.

The taxonomy reveals this work sits within 'Theoretical Foundations and Regret Analysis,' one of eight major branches addressing adversarial RL. Neighboring leaves include 'Adversarial Restless Multi-Armed Bandits' (bandit feedback without full MDP structure) and 'Maximum Entropy and Robust RL Theory' (formal robustness proofs via MaxEnt). The sibling papers in the same leaf examine dynamic regret bounds and sample complexity under adversarial losses, but the taxonomy's scope notes explicitly exclude known-transition settings, positioning this work at the intersection of unknown dynamics and adversarial control where theoretical guarantees remain challenging.

Among 30 candidates examined, the first contribution (characterizing history-dependent optimal policies) shows one refutable candidate from 10 examined, suggesting some prior theoretical characterization exists in the limited search scope. The second contribution (AD-FTRL algorithm) and third contribution (minimax optimal bounds with matching lower bound) each examined 10 candidates with zero refutations, indicating these algorithmic and optimality results appear more novel within the searched literature. The analysis explicitly notes this reflects top-K semantic search plus citation expansion, not exhaustive coverage, so additional related work may exist beyond the examined set.

Given the sparse four-paper leaf and limited overlap detected across 30 candidates, the work appears to advance a relatively underexplored theoretical direction. The history-dependent policy characterization shows modest prior overlap, while the algorithmic and optimality contributions exhibit stronger novelty signals within the examined scope. However, the small search scale and narrow leaf population mean this assessment captures local novelty rather than field-wide positioning, and broader literature may contain additional relevant precedents not surfaced by semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reinforcement learning with adversarially chosen transition kernels. This field examines how agents can learn effective policies when the environment's dynamics are selected by an adversary, rather than being fixed or stochastic in a benign sense. The taxonomy reveals a rich structure spanning eight main branches. Theoretical Foundations and Regret Analysis investigates minimax optimality and regret bounds in adversarial MDPs, often under unknown transitions, as seen in works like Minimax Optimal Adversarial RL[0] and Dynamic Regret Adversarial MDPs[6]. Robustness Enhancement Methods focuses on training techniques that harden policies against perturbations, including adversarial regularization and distributionally robust approaches such as Distributionally Robust Policy[8]. Adversarial Attack Methods and Vulnerability Analysis explores how to craft effective attacks on trained agents, with studies like Robust DRL Adversarial Perturbations[1] and Adversarial Attacks Training Survey[3] characterizing threat models. Meanwhile, Domain-Specific Applications and Specialized Environments branches demonstrate how adversarial RL principles apply to cybersecurity, autonomous driving, and other real-world testbeds. Several active lines of work highlight contrasting emphases and open questions. One thread pursues tight regret guarantees for online learning in adversarial MDPs, balancing computational efficiency with statistical optimality; another thread emphasizes practical robustness via adversarial training or auxiliary models that anticipate worst-case perturbations. Minimax Optimal Adversarial RL[0] sits squarely within the theoretical branch on adversarial MDPs with unknown transitions, aiming to establish minimax optimal rates. It shares conceptual ground with Dynamic Regret Adversarial MDPs[6], which also tackles regret minimization under adversarial dynamics, and with No-Regret Online RL[41], which explores no-regret guarantees in online settings. Compared to these neighbors, Minimax Optimal Adversarial RL[0] appears to emphasize achieving the tightest possible bounds in the unknown-transition regime, whereas Dynamic Regret Adversarial MDPs[6] may focus more on time-varying adversaries. This positioning underscores ongoing debates about the trade-offs between sample complexity, computational tractability, and the strength of adversarial assumptions.

Claimed Contributions

Characterization of optimal policy under adversarial transitions

Can Refute

10 retrieved papers

The authors establish that when transition kernels are chosen adversarially at each episode, the optimal policy must depend on the full history of observations rather than only the current state. This contrasts with standard MDPs where Markov policies are known to be optimal.

10 retrieved papers

Can Refute

AD-FTRL algorithm with sublinear regret guarantee

10 retrieved papers

The authors design a Follow-the-Regularized-Leader algorithm that operates with bandit feedback and unknown adversarial transitions. The algorithm uses trajectory-level occupancy measures and importance sampling with a carefully designed regularization term to achieve sublinear regret without requiring knowledge of transition kernels.

10 retrieved papers

Minimax optimal regret bound with matching lower bound

10 retrieved papers

The authors construct a hard MDP instance and prove a matching lower bound that demonstrates their algorithm achieves the minimax optimal regret. Their proof introduces a new analytical approach using composite hypothesis testing for handling adversarial transitions, providing a complete characterization of the fundamental difficulty of this problem.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Dynamic regret of adversarial MDPs with unknown transition and linear function approximation PDF

Long Fei Li, Longfei Li, Peng Zhao, Long-Fei Li, Zhi-Hua Zhou, ZhiâHua Zhou (2024)

[26] Learning adversarial markov decision processes with bandit feedback and unknown transition PDF

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu (2020)

[41] No-regret online reinforcement learning with adversarial losses and transitions PDF

Jin, Tiancheng, Tiancheng Jin, Liu Junyan, Junyan Liu, Rouyer, ChloÃ©, ChloÃ© Rouyer, Chang, William, William S. C. Chang, Wei, Chen-Yu, Chen-Yu Wei, William Chang, Luo, Haipeng, Haipeng Luo (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Characterization of optimal policy under adversarial transitions

[75] On the foundation of distributionally robust reinforcement learning PDF

Can Refute

[71] Evading model poisoning attacks in federated learning by a long-short-term-memory-based approach PDF

Cannot Refute

[72] Relgan: Relational generative adversarial networks for text generation PDF

Cannot Refute

[73] Two-phase real-time task offloading framework for edge-IoT systems using spiking neuromorphic coordination and holographic memory reuse PDF

Cannot Refute

[74] Robust reinforcement learning on state observations with learned optimal adversary PDF

Cannot Refute

[76] A bayesian learning algorithm for unknown zero-sum stochastic games with an arbitrary opponent PDF

Cannot Refute

[77] Risk-sensitive safety analysis using conditional value-at-risk PDF

Cannot Refute

[78] Anomaly detection for wind turbines using long short-term memory-based variational autoencoder wasserstein generation adversarial network under semi â¦ PDF

Cannot Refute

[79] A PDE Approach to the Prediction of a Binary Sequence with Advice from Two HistoryâDependent Experts PDF

Cannot Refute

[80] Online Prediction with HistoryâDependent Experts: The General Case PDF

Cannot Refute

Contribution

AD-FTRL algorithm with sublinear regret guarantee

[61] The best of both worlds: stochastic and adversarial episodic mdps with unknown transition PDF

Cannot Refute

[62] A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of and its Application to Best-of-Both-Worlds PDF

Cannot Refute

[63] A blackbox approach to best of both worlds in bandits and beyond PDF

Cannot Refute

[64] Simultaneously learning stochastic and adversarial bandits under the position-based model PDF

Cannot Refute

[65] Towards best-of-all-worlds online learning with feedback graphs PDF

Cannot Refute

[66] On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond PDF

Cannot Refute

[67] Best-of-Both Worlds for linear contextual bandits with paid observations PDF

Cannot Refute

[68] Faster Convergence for Unknown-Game Bandits PDF

Cannot Refute

[69] Self-Concordant Perturbations for Linear Bandits PDF

Cannot Refute

[70] Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback PDF

Cannot Refute

Contribution

Minimax optimal regret bound with matching lower bound

[51] Refining minimax regret for unsupervised environment design PDF

Cannot Refute

[52] Learning adversarial linear mixture markov decision processes with bandit feedback and unknown transition PDF

Cannot Refute

[53] Differentially private no-regret exploration in adversarial markov decision processes PDF

Cannot Refute

[54] Dynamic regret of adversarial linear mixture MDPs PDF

Cannot Refute

[55] Online convex optimization in adversarial markov decision processes PDF

Cannot Refute

[56] Optimistic regret bounds for online learning in adversarial Markov decision processes PDF

Cannot Refute

[57] Near-optimal regret for adversarial mdp with delayed bandit feedback PDF

Cannot Refute

[58] Learning adversarial mdps with stochastic hard constraints PDF

Cannot Refute

[59] Narrowing the gap between adversarial and stochastic MDPs via policy optimization PDF

Cannot Refute

[60] Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously PDF

Cannot Refute

Minimax Optimal Adversarial Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Dynamic regret of adversarial MDPs with unknown transition and linear function approximation PDF

[26] Learning adversarial markov decision processes with bandit feedback and unknown transition PDF

[41] No-regret online reinforcement learning with adversarial losses and transitions PDF

Contribution Analysis

Characterization of optimal policy under adversarial transitions

[75] On the foundation of distributionally robust reinforcement learning PDF

[71] Evading model poisoning attacks in federated learning by a long-short-term-memory-based approach PDF

[72] Relgan: Relational generative adversarial networks for text generation PDF

[73] Two-phase real-time task offloading framework for edge-IoT systems using spiking neuromorphic coordination and holographic memory reuse PDF

[74] Robust reinforcement learning on state observations with learned optimal adversary PDF

[76] A bayesian learning algorithm for unknown zero-sum stochastic games with an arbitrary opponent PDF

[77] Risk-sensitive safety analysis using conditional value-at-risk PDF

[78] Anomaly detection for wind turbines using long short-term memory-based variational autoencoder wasserstein generation adversarial network under semi â¦ PDF

[79] A PDE Approach to the Prediction of a Binary Sequence with Advice from Two HistoryâDependent Experts PDF

[80] Online Prediction with HistoryâDependent Experts: The General Case PDF

AD-FTRL algorithm with sublinear regret guarantee

[61] The best of both worlds: stochastic and adversarial episodic mdps with unknown transition PDF

[62] A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of and its Application to Best-of-Both-Worlds PDF

[63] A blackbox approach to best of both worlds in bandits and beyond PDF

[64] Simultaneously learning stochastic and adversarial bandits under the position-based model PDF

[65] Towards best-of-all-worlds online learning with feedback graphs PDF

[66] On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond PDF

[67] Best-of-Both Worlds for linear contextual bandits with paid observations PDF

[68] Faster Convergence for Unknown-Game Bandits PDF

[69] Self-Concordant Perturbations for Linear Bandits PDF

[70] Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback PDF

Minimax optimal regret bound with matching lower bound

[51] Refining minimax regret for unsupervised environment design PDF

[52] Learning adversarial linear mixture markov decision processes with bandit feedback and unknown transition PDF

[53] Differentially private no-regret exploration in adversarial markov decision processes PDF

[54] Dynamic regret of adversarial linear mixture MDPs PDF

[55] Online convex optimization in adversarial markov decision processes PDF

[56] Optimistic regret bounds for online learning in adversarial Markov decision processes PDF

[57] Near-optimal regret for adversarial mdp with delayed bandit feedback PDF

[58] Learning adversarial mdps with stochastic hard constraints PDF

[59] Narrowing the gap between adversarial and stochastic MDPs via policy optimization PDF

[60] Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously PDF

Table of Contents

[78] Anomaly detection for wind turbines using long short-term memory-based variational autoencoder wasserstein generation adversarial network under semi â¦ PDF

[79] A PDE Approach to the Prediction of a Binary Sequence with Advice from Two HistoryâDependent Experts PDF

[80] Online Prediction with HistoryâDependent Experts: The General Case PDF