Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^{\pi}$ Realizability for Deterministic Dynamics

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Theory of Reinforcement LearningPolicy IterationLinear Function Approximation

We study computationally and statistically efficient reinforcement learning under the linear $Q^{\pi}$ realizability assumption, where any policy's $Q$ -function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named \emph{Frozen Policy Iteration}, under the linear $Q^{\pi}$ realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{d^2H^6T})$ , where $d$ is the dimensionality of the feature space, $H$ is the horizon length, and $T$ is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of our setting with $H = 1$ .

Existing policy iteration algorithms under the same setting heavily rely on repeatedly sampling the same state by access to the simulator, which is not implementable in the online setting with stochastic initial states studied in this paper. In contrast, our new algorithm circumvents this limitation by strategically using only high-confidence part of the trajectory data and freezing the policy for well-explored states, which ensures that all data used by our algorithm remains effectively \emph{on-policy} during the whole course of learning. We further demonstrate the versatility of our approach by extending it to the Uniform-PAC setting and to function classes with bounded eluder dimension.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Frozen Policy Iteration for computationally efficient online reinforcement learning under linear Q-π realizability, where each policy's Q-function is linear in given features. According to the taxonomy, this work resides in the 'Linear Q-Pi Realizability' leaf, which contains only two papers total. This is a notably sparse research direction within the broader 'Core Linear Realizability Frameworks' branch, suggesting the paper addresses a less crowded problem setting compared to adjacent areas like linear Q-star realizability or reward-free exploration frameworks.

The taxonomy reveals that neighboring leaves include 'Linear Q-Star Realizability' (three papers assuming optimal Q-function linearity) and 'Reward-Free and Exploration Frameworks' (two papers on exploration without pre-specified rewards). The scope note for the paper's leaf explicitly excludes optimal Q-function linearity assumptions and nonlinear extensions, positioning this work as addressing policy-dependent realizability under deterministic or stochastic dynamics. The taxonomy narrative highlights contrasts between frozen versus adaptive policies and local versus global planning, indicating that the paper's frozen policy approach represents a distinct methodological choice within the linear realizability landscape.

Among twenty-eight candidates examined via limited semantic search, the contribution-level analysis shows mixed novelty signals. The core Frozen Policy Iteration algorithm examined ten candidates with zero refutations, suggesting no clear prior work on this specific algorithmic approach within the search scope. Similarly, the regret bound contribution examined ten candidates without refutation. However, the extension to Uniform-PAC and bounded eluder dimension examined eight candidates and found three refutable matches, indicating more substantial prior work on this aspect. These statistics reflect a targeted literature search rather than exhaustive coverage, so unexamined work may exist beyond the top-K semantic matches.

Given the sparse taxonomy leaf and limited search scope, the algorithmic and regret bound contributions appear relatively novel within the examined candidate set, while the Uniform-PAC extension shows clearer overlap with prior work. The analysis covers top-thirty semantic matches and does not claim exhaustive field coverage, so the novelty assessment remains provisional pending broader literature review.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Computationally efficient reinforcement learning under linear Q-function realizability. The field organizes around several major branches that reflect different modeling assumptions and problem settings. Core Linear Realizability Frameworks establish foundational theory for settings where Q-functions admit linear representations, including works on linear Q-π realizability such as Frozen Policy Iteration[0] and Local Planning Linear[6], as well as broader linear MDP structures exemplified by Efficient Linear RL[1] and Reward-Free Linear RL[2]. Offline and Batch Reinforcement Learning addresses learning from fixed datasets, with contributions like Offline Linear RL Limits[3] and Linear Programming Offline RL[15] exploring fundamental limits and algorithmic approaches. Extensions Beyond Standard Linear MDPs broaden the scope to richer function classes and relaxed assumptions, while Specialized Problem Settings tackle constrained optimization (Constrained Linear RL[4]), partial observability (POMDP Linear Approximation[5]), and nonstationarity (Nonstationary Linear RL[13]). Multi-Agent and Game-Theoretic Settings, Algorithmic Techniques and Theoretical Analysis, Domain-Specific Applications, and Practical Implementation branches round out the taxonomy by addressing competitive environments, algorithmic innovations, real-world deployments, and computational methods. A particularly active line of work focuses on balancing statistical efficiency with computational tractability, exploring how different realizability assumptions enable polynomial-time algorithms. Frozen Policy Iteration[0] sits within the Linear Q-π Realizability cluster, emphasizing policy evaluation under frozen policies—a setting that contrasts with the more general linear MDP frameworks pursued by Efficient Linear RL[1] and the reward-free exploration paradigm of Reward-Free Linear RL[2]. Compared to Local Planning Linear[6], which investigates local planning under linear structure, Frozen Policy Iteration[0] adopts a complementary perspective by fixing the policy during iteration. Meanwhile, offline methods like Offline Linear RL Limits[3] highlight fundamental barriers when learning from batch data, raising questions about coverage and distribution shift that differ from the online exploration challenges addressed by works in the core realizability branch. These contrasting emphases—frozen versus adaptive policies, online versus offline data, and global versus local planning—illustrate the diverse strategies researchers employ to achieve computational efficiency under linear Q-function assumptions.

Claimed Contributions

Frozen Policy Iteration algorithm for computationally efficient online RL

10 retrieved papers

The authors introduce Frozen Policy Iteration (FPI), a computationally efficient algorithm for online reinforcement learning under linear Q^π realizability. Unlike prior methods requiring simulators or intractable optimization, FPI works in the standard online setting with stochastic initial states and deterministic dynamics by strategically using high-confidence trajectory data and freezing policies for well-explored states.

10 retrieved papers

Regret bound of Õ(√(d²H⁶T)) optimal for linear bandits

10 retrieved papers

The algorithm achieves a regret bound of Õ(√(d²H⁶T)) where d is feature dimension, H is horizon, and T is episodes. This bound is proven optimal for the special case of linear contextual bandits (H=1), demonstrating statistical efficiency alongside computational tractability.

10 retrieved papers

Extension to Uniform-PAC setting and bounded eluder dimension

Can Refute

8 retrieved papers

The authors extend their Frozen Policy Iteration framework beyond the basic regret setting to achieve Uniform-PAC guarantees and to handle more general function classes characterized by bounded eluder dimension, demonstrating the broader applicability of their freezing-based approach.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Efficient Local Planning with Linear Function Approximation PDF

Yin Dong, Hao, Botao, Dong Yin, Abbasi-Yadkori, Yasin, Botao Hao, LaziÄ, Nevena, Yasin Abbasi-Yadkori, Szepesvari, Csaba, N. Lazi'c, Csaba Szepesvari (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Frozen Policy Iteration algorithm for computationally efficient online RL

[1] Provably Efficient Reinforcement Learning with Linear Function Approximation PDF

Cannot Refute

[2] On reward-free reinforcement learning with linear function approximation PDF

Cannot Refute

[4] Provably Efficient Model-Free Constrained RL with Linear Function Approximation PDF

Cannot Refute

[5] Sample-Efficient Reinforcement Learning for POMDPs with Linear Function Approximations PDF

Cannot Refute

[6] Efficient Local Planning with Linear Function Approximation PDF

Cannot Refute

[7] Optimism in Reinforcement Learning with Generalized Linear Function Approximation PDF

Cannot Refute

[9] Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints PDF

Cannot Refute

[54] Improved regret for efficient online reinforcement learning with linear function approximation PDF

Cannot Refute

[55] Finite-Sample Analysis for SARSA with Linear Function Approximation PDF

Cannot Refute

[56] Reinforcement learning from partial observation: Linear function approximation with provable sample efficiency PDF

Cannot Refute

Contribution

Regret bound of Õ(√(d²H⁶T)) optimal for linear bandits

[57] Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency PDF

Cannot Refute

[58] Thompson Sampling for Contextual Bandits with Linear Payoffs PDF

Cannot Refute

[59] No-regret exploration in contextual reinforcement learning PDF

Cannot Refute

[60] Federated Linear Contextual Bandits PDF

Cannot Refute

[61] On the Optimal Regret of Locally Private Linear Contextual Bandit PDF

Cannot Refute

[62] Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback PDF

Cannot Refute

[63] Conservative Contextual Linear Bandits PDF

Cannot Refute

[64] Contextual Bandit Algorithms with Supervised Learning Guarantees PDF

Cannot Refute

[65] Simple Regret Minimization for Contextual Bandits PDF

Cannot Refute

[66] On the Minimax Regret for Contextual Linear Bandits and Multi-Armed Bandits with Expert Advice PDF

Cannot Refute

Contribution

Extension to Uniform-PAC setting and bounded eluder dimension

[47] Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation PDF

Can Refute

[48] Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension PDF

Can Refute

[51] VOL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation PDF

Can Refute

[46] Provably efficient reinforcement learning with kernel and neural function approximations PDF

Cannot Refute

[49] Unified algorithms for RL with Decision-Estimation Coefficients: PAC, reward-free, preference-based learning and beyond PDF

Cannot Refute

[50] Model-based reinforcement learning with value-targeted regression PDF

Cannot Refute

[52] Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension (Supplementary material) PDF

Cannot Refute

[53] On the Statistical Efficiency of Mean Field RL with General Function Approximation PDF

Cannot Refute

Frozen Policy Iteration: Computationally Efficient RL under Linear QπQ^{\pi}Qπ Realizability for Deterministic Dynamics

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Efficient Local Planning with Linear Function Approximation PDF

Contribution Analysis

Frozen Policy Iteration algorithm for computationally efficient online RL

[1] Provably Efficient Reinforcement Learning with Linear Function Approximation PDF

[2] On reward-free reinforcement learning with linear function approximation PDF

[4] Provably Efficient Model-Free Constrained RL with Linear Function Approximation PDF

[5] Sample-Efficient Reinforcement Learning for POMDPs with Linear Function Approximations PDF

[6] Efficient Local Planning with Linear Function Approximation PDF

[7] Optimism in Reinforcement Learning with Generalized Linear Function Approximation PDF

[9] Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints PDF

[54] Improved regret for efficient online reinforcement learning with linear function approximation PDF

[55] Finite-Sample Analysis for SARSA with Linear Function Approximation PDF

[56] Reinforcement learning from partial observation: Linear function approximation with provable sample efficiency PDF

Regret bound of Õ(√(d²H⁶T)) optimal for linear bandits

[57] Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency PDF

[58] Thompson Sampling for Contextual Bandits with Linear Payoffs PDF

[59] No-regret exploration in contextual reinforcement learning PDF

[60] Federated Linear Contextual Bandits PDF

[61] On the Optimal Regret of Locally Private Linear Contextual Bandit PDF

[62] Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback PDF

[63] Conservative Contextual Linear Bandits PDF

[64] Contextual Bandit Algorithms with Supervised Learning Guarantees PDF

[65] Simple Regret Minimization for Contextual Bandits PDF

[66] On the Minimax Regret for Contextual Linear Bandits and Multi-Armed Bandits with Expert Advice PDF

Extension to Uniform-PAC setting and bounded eluder dimension

[47] Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation PDF

[48] Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension PDF

[51] VOL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation PDF

[46] Provably efficient reinforcement learning with kernel and neural function approximations PDF

[49] Unified algorithms for RL with Decision-Estimation Coefficients: PAC, reward-free, preference-based learning and beyond PDF

[50] Model-based reinforcement learning with value-targeted regression PDF

[52] Uniform-PAC Guarantees for Model-Based RL with Bounded Eluder Dimension (Supplementary material) PDF

[53] On the Statistical Efficiency of Mean Field RL with General Function Approximation PDF

Table of Contents

Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^{\pi}$ Realizability for Deterministic Dynamics