How to Lose Inherent Counterfactuality in Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
counterfactualityinherent skillsreinforcement learning
Abstract:

Learning in high-dimensional MDPs with complex state dynamics became possible with the progress achieved in reinforcement learning research. At the same time, deep neural policies have been observed to be highly unstable with respect to the minor variations in their state space, causing volatile and unpredictable behaviour. To alleviate these volatilities, a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss to ensure local ϵ\epsilon-invariance in the state space. In this paper, we provide theoretical foundations on the impact of ϵ\epsilon-local invariance training on the deep neural policy manifolds. Our comprehensive theoretical and experimental analysis reveals that standard reinforcement learning inherently learns counterfactual values while recent training techniques that focus on explicitly enforcing ϵ\epsilon-local invariance cause policies to lose counterfactuality, and further result in learning misaligned and inconsistent values. In connection to this analysis, we further highlight that this line of training methods break the core intuition and the true biological inspiration of reinforcement learning, and introduce an intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to AI agents trained via ϵ\epsilon-local invariance methods. The misalignment, inaccuracy and the loss of counterfactuality revealed in our paper further demonstrate the need to rethink the approach in establishing truly reliable and generalizable reinforcement learning policies.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how epsilon-local invariance training affects deep reinforcement learning policies, focusing on counterfactual value learning and policy manifold geometry. Within the taxonomy, it occupies the 'Counterfactual Value Learning and Policy Manifolds' leaf under 'Theoretical Foundations and Counterfactual Analysis'. Notably, this leaf contains only the original paper itself, with no sibling papers identified. This positioning suggests the work addresses a relatively sparse research direction, examining theoretical properties that have received limited direct attention in the literature surveyed.

The taxonomy reveals a clear structural division: theoretical foundations examining counterfactual reasoning versus practical training methods emphasizing adversarial robustness. The original paper sits in the former branch, while the neighboring 'Adversarial Robustness Training' leaf (containing one paper on active adversarial training) represents the practical counterpart. The taxonomy's scope notes explicitly separate theoretical counterfactual analysis from adversarial training techniques, indicating these represent distinct but complementary research threads. The paper's focus on manifold geometry and value alignment positions it at the conceptual foundation of understanding epsilon-local constraints, rather than in the algorithmic development space.

Among twenty candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The second contribution (inherent counterfactual reasoning in standard RL) examined ten candidates with zero refutable matches, as did the third contribution (counterfactuality-robustness trade-off). The first contribution (theoretical analysis of epsilon-local invariance effects) examined zero candidates. This pattern suggests that within the limited search scope of top-K semantic matches, the specific theoretical framing around counterfactual loss and policy manifolds appears relatively unexplored. However, the small candidate pool (twenty papers total) means the analysis covers a narrow slice of potentially relevant work.

Given the limited search scope and sparse taxonomy structure, the work appears to occupy a relatively novel theoretical niche within the examined literature. The absence of sibling papers and refutable candidates among twenty examined suggests the specific angle—connecting epsilon-local invariance to counterfactual value learning—has not been directly addressed in closely related work. However, this assessment is constrained by the top-K semantic search methodology and does not reflect an exhaustive survey of reinforcement learning robustness or theoretical RL literature more broadly.

Taxonomy

Core-task Taxonomy Papers
1
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: impact of epsilon-local invariance training on reinforcement learning policies. The field structure suggested by this taxonomy divides into two main branches. The first, Theoretical Foundations and Counterfactual Analysis, examines the conceptual underpinnings of how policies represent and leverage counterfactual reasoning—essentially, understanding what alternative actions might have yielded and how epsilon-local constraints shape the manifold of learned policies. The second branch, Training Methods and Robustness, focuses on practical algorithms and techniques for building policies that remain stable under small perturbations, often through adversarial or regularization-based approaches. Together, these branches capture both the 'why' and the 'how' of epsilon-local invariance: one side investigates the theoretical implications for value learning and policy geometry, while the other develops concrete training recipes to achieve robustness. A particularly active line of work explores the tension between enforcing local invariance and preserving the policy's ability to distinguish meaningful state differences. Losing Inherent Counterfactuality[0] sits within the Counterfactual Value Learning and Policy Manifolds cluster, emphasizing how epsilon-local training can inadvertently suppress the counterfactual signals that guide effective exploration and credit assignment. This contrasts with approaches like Active Adversarial Training[1], which prioritize robustness by explicitly injecting perturbations during learning, potentially at the cost of nuanced counterfactual reasoning. The central open question is whether one can design training schemes that simultaneously maintain local invariance for robustness and retain the rich counterfactual structure needed for sample-efficient learning, or whether these goals inherently trade off against one another.

Claimed Contributions

Theoretical analysis of ε-local invariance training effects on Q-functions

The authors present a formal theoretical framework demonstrating that ε-local invariance training fundamentally alters learned value judgments in reinforcement learning. They prove an inherent trade-off between accurate Q-value estimation and robustness, showing that ε-invariant Q-functions overestimate optimal values and misalign counterfactual action rankings.

0 retrieved papers
Discovery that standard RL possesses inherent counterfactual reasoning ability

The authors establish that standard reinforcement learning naturally learns counterfactual values aligned with human decision-making processes, while ε-invariance training methods cause policies to lose this inherent counterfactual ability, resulting in inaccurate, inconsistent, and misaligned value functions.

10 retrieved papers
Identification of fundamental trade-off between counterfactuality and robustness

The authors formalize and demonstrate through theory and experiments a fundamental trade-off showing that certified ε-invariance training sacrifices the inherent counterfactual reasoning capabilities of standard RL in pursuit of robustness guarantees, revealing core mechanisms behind this phenomenon.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of ε-local invariance training effects on Q-functions

The authors present a formal theoretical framework demonstrating that ε-local invariance training fundamentally alters learned value judgments in reinforcement learning. They prove an inherent trade-off between accurate Q-value estimation and robustness, showing that ε-invariant Q-functions overestimate optimal values and misalign counterfactual action rankings.

Contribution

Discovery that standard RL possesses inherent counterfactual reasoning ability

The authors establish that standard reinforcement learning naturally learns counterfactual values aligned with human decision-making processes, while ε-invariance training methods cause policies to lose this inherent counterfactual ability, resulting in inaccurate, inconsistent, and misaligned value functions.

Contribution

Identification of fundamental trade-off between counterfactuality and robustness

The authors formalize and demonstrate through theory and experiments a fundamental trade-off showing that certified ε-invariance training sacrifices the inherent counterfactual reasoning capabilities of standard RL in pursuit of robustness guarantees, revealing core mechanisms behind this phenomenon.