Frozen Policy Iteration: Computationally Efficient RL under Linear QπQ^{\pi} Realizability for Deterministic Dynamics

ICLR 2026 Conference SubmissionAnonymous Authors
Theory of Reinforcement LearningPolicy IterationLinear Function Approximation
Abstract:

We study computationally and statistically efficient reinforcement learning under the linear QπQ^{\pi} realizability assumption, where any policy's QQ-function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named \emph{Frozen Policy Iteration}, under the linear QπQ^{\pi} realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of O~(d2H6T)\widetilde{O}(\sqrt{d^2H^6T}), where dd is the dimensionality of the feature space, HH is the horizon length, and TT is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of our setting with H=1H = 1.

Existing policy iteration algorithms under the same setting heavily rely on repeatedly sampling the same state by access to the simulator, which is not implementable in the online setting with stochastic initial states studied in this paper. In contrast, our new algorithm circumvents this limitation by strategically using only high-confidence part of the trajectory data and freezing the policy for well-explored states, which ensures that all data used by our algorithm remains effectively \emph{on-policy} during the whole course of learning. We further demonstrate the versatility of our approach by extending it to the Uniform-PAC setting and to function classes with bounded eluder dimension.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Frozen Policy Iteration for computationally efficient online reinforcement learning under linear Q-π realizability, where each policy's Q-function is linear in given features. According to the taxonomy, this work resides in the 'Linear Q-Pi Realizability' leaf, which contains only two papers total. This is a notably sparse research direction within the broader 'Core Linear Realizability Frameworks' branch, suggesting the paper addresses a less crowded problem setting compared to adjacent areas like linear Q-star realizability or reward-free exploration frameworks.

The taxonomy reveals that neighboring leaves include 'Linear Q-Star Realizability' (three papers assuming optimal Q-function linearity) and 'Reward-Free and Exploration Frameworks' (two papers on exploration without pre-specified rewards). The scope note for the paper's leaf explicitly excludes optimal Q-function linearity assumptions and nonlinear extensions, positioning this work as addressing policy-dependent realizability under deterministic or stochastic dynamics. The taxonomy narrative highlights contrasts between frozen versus adaptive policies and local versus global planning, indicating that the paper's frozen policy approach represents a distinct methodological choice within the linear realizability landscape.

Among twenty-eight candidates examined via limited semantic search, the contribution-level analysis shows mixed novelty signals. The core Frozen Policy Iteration algorithm examined ten candidates with zero refutations, suggesting no clear prior work on this specific algorithmic approach within the search scope. Similarly, the regret bound contribution examined ten candidates without refutation. However, the extension to Uniform-PAC and bounded eluder dimension examined eight candidates and found three refutable matches, indicating more substantial prior work on this aspect. These statistics reflect a targeted literature search rather than exhaustive coverage, so unexamined work may exist beyond the top-K semantic matches.

Given the sparse taxonomy leaf and limited search scope, the algorithmic and regret bound contributions appear relatively novel within the examined candidate set, while the Uniform-PAC extension shows clearer overlap with prior work. The analysis covers top-thirty semantic matches and does not claim exhaustive field coverage, so the novelty assessment remains provisional pending broader literature review.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
28
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Computationally efficient reinforcement learning under linear Q-function realizability. The field organizes around several major branches that reflect different modeling assumptions and problem settings. Core Linear Realizability Frameworks establish foundational theory for settings where Q-functions admit linear representations, including works on linear Q-π realizability such as Frozen Policy Iteration[0] and Local Planning Linear[6], as well as broader linear MDP structures exemplified by Efficient Linear RL[1] and Reward-Free Linear RL[2]. Offline and Batch Reinforcement Learning addresses learning from fixed datasets, with contributions like Offline Linear RL Limits[3] and Linear Programming Offline RL[15] exploring fundamental limits and algorithmic approaches. Extensions Beyond Standard Linear MDPs broaden the scope to richer function classes and relaxed assumptions, while Specialized Problem Settings tackle constrained optimization (Constrained Linear RL[4]), partial observability (POMDP Linear Approximation[5]), and nonstationarity (Nonstationary Linear RL[13]). Multi-Agent and Game-Theoretic Settings, Algorithmic Techniques and Theoretical Analysis, Domain-Specific Applications, and Practical Implementation branches round out the taxonomy by addressing competitive environments, algorithmic innovations, real-world deployments, and computational methods. A particularly active line of work focuses on balancing statistical efficiency with computational tractability, exploring how different realizability assumptions enable polynomial-time algorithms. Frozen Policy Iteration[0] sits within the Linear Q-π Realizability cluster, emphasizing policy evaluation under frozen policies—a setting that contrasts with the more general linear MDP frameworks pursued by Efficient Linear RL[1] and the reward-free exploration paradigm of Reward-Free Linear RL[2]. Compared to Local Planning Linear[6], which investigates local planning under linear structure, Frozen Policy Iteration[0] adopts a complementary perspective by fixing the policy during iteration. Meanwhile, offline methods like Offline Linear RL Limits[3] highlight fundamental barriers when learning from batch data, raising questions about coverage and distribution shift that differ from the online exploration challenges addressed by works in the core realizability branch. These contrasting emphases—frozen versus adaptive policies, online versus offline data, and global versus local planning—illustrate the diverse strategies researchers employ to achieve computational efficiency under linear Q-function assumptions.

Claimed Contributions

Frozen Policy Iteration algorithm for computationally efficient online RL

The authors introduce Frozen Policy Iteration (FPI), a computationally efficient algorithm for online reinforcement learning under linear Q^π realizability. Unlike prior methods requiring simulators or intractable optimization, FPI works in the standard online setting with stochastic initial states and deterministic dynamics by strategically using high-confidence trajectory data and freezing policies for well-explored states.

10 retrieved papers
Regret bound of Õ(√(d²H⁶T)) optimal for linear bandits

The algorithm achieves a regret bound of Õ(√(d²H⁶T)) where d is feature dimension, H is horizon, and T is episodes. This bound is proven optimal for the special case of linear contextual bandits (H=1), demonstrating statistical efficiency alongside computational tractability.

10 retrieved papers
Extension to Uniform-PAC setting and bounded eluder dimension

The authors extend their Frozen Policy Iteration framework beyond the basic regret setting to achieve Uniform-PAC guarantees and to handle more general function classes characterized by bounded eluder dimension, demonstrating the broader applicability of their freezing-based approach.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Frozen Policy Iteration algorithm for computationally efficient online RL

The authors introduce Frozen Policy Iteration (FPI), a computationally efficient algorithm for online reinforcement learning under linear Q^π realizability. Unlike prior methods requiring simulators or intractable optimization, FPI works in the standard online setting with stochastic initial states and deterministic dynamics by strategically using high-confidence trajectory data and freezing policies for well-explored states.

Contribution

Regret bound of Õ(√(d²H⁶T)) optimal for linear bandits

The algorithm achieves a regret bound of Õ(√(d²H⁶T)) where d is feature dimension, H is horizon, and T is episodes. This bound is proven optimal for the special case of linear contextual bandits (H=1), demonstrating statistical efficiency alongside computational tractability.

Contribution

Extension to Uniform-PAC setting and bounded eluder dimension

The authors extend their Frozen Policy Iteration framework beyond the basic regret setting to achieve Uniform-PAC guarantees and to handle more general function classes characterized by bounded eluder dimension, demonstrating the broader applicability of their freezing-based approach.