How to Lose Inherent Counterfactuality in Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

counterfactualityinherent skillsreinforcement learning

Learning in high-dimensional MDPs with complex state dynamics became possible with the progress achieved in reinforcement learning research. At the same time, deep neural policies have been observed to be highly unstable with respect to the minor variations in their state space, causing volatile and unpredictable behaviour. To alleviate these volatilities, a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss to ensure local $\epsilon$ -invariance in the state space. In this paper, we provide theoretical foundations on the impact of $\epsilon$ -local invariance training on the deep neural policy manifolds. Our comprehensive theoretical and experimental analysis reveals that standard reinforcement learning inherently learns counterfactual values while recent training techniques that focus on explicitly enforcing $\epsilon$ -local invariance cause policies to lose counterfactuality, and further result in learning misaligned and inconsistent values. In connection to this analysis, we further highlight that this line of training methods break the core intuition and the true biological inspiration of reinforcement learning, and introduce an intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to AI agents trained via $\epsilon$ -local invariance methods. The misalignment, inaccuracy and the loss of counterfactuality revealed in our paper further demonstrate the need to rethink the approach in establishing truly reliable and generalizable reinforcement learning policies.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how epsilon-local invariance training affects deep reinforcement learning policies, focusing on counterfactual value learning and policy manifold geometry. Within the taxonomy, it occupies the 'Counterfactual Value Learning and Policy Manifolds' leaf under 'Theoretical Foundations and Counterfactual Analysis'. Notably, this leaf contains only the original paper itself, with no sibling papers identified. This positioning suggests the work addresses a relatively sparse research direction, examining theoretical properties that have received limited direct attention in the literature surveyed.

The taxonomy reveals a clear structural division: theoretical foundations examining counterfactual reasoning versus practical training methods emphasizing adversarial robustness. The original paper sits in the former branch, while the neighboring 'Adversarial Robustness Training' leaf (containing one paper on active adversarial training) represents the practical counterpart. The taxonomy's scope notes explicitly separate theoretical counterfactual analysis from adversarial training techniques, indicating these represent distinct but complementary research threads. The paper's focus on manifold geometry and value alignment positions it at the conceptual foundation of understanding epsilon-local constraints, rather than in the algorithmic development space.

Among twenty candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The second contribution (inherent counterfactual reasoning in standard RL) examined ten candidates with zero refutable matches, as did the third contribution (counterfactuality-robustness trade-off). The first contribution (theoretical analysis of epsilon-local invariance effects) examined zero candidates. This pattern suggests that within the limited search scope of top-K semantic matches, the specific theoretical framing around counterfactual loss and policy manifolds appears relatively unexplored. However, the small candidate pool (twenty papers total) means the analysis covers a narrow slice of potentially relevant work.

Given the limited search scope and sparse taxonomy structure, the work appears to occupy a relatively novel theoretical niche within the examined literature. The absence of sibling papers and refutable candidates among twenty examined suggests the specific angle—connecting epsilon-local invariance to counterfactual value learning—has not been directly addressed in closely related work. However, this assessment is constrained by the top-K semantic search methodology and does not reflect an exhaustive survey of reinforcement learning robustness or theoretical RL literature more broadly.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: impact of epsilon-local invariance training on reinforcement learning policies. The field structure suggested by this taxonomy divides into two main branches. The first, Theoretical Foundations and Counterfactual Analysis, examines the conceptual underpinnings of how policies represent and leverage counterfactual reasoning—essentially, understanding what alternative actions might have yielded and how epsilon-local constraints shape the manifold of learned policies. The second branch, Training Methods and Robustness, focuses on practical algorithms and techniques for building policies that remain stable under small perturbations, often through adversarial or regularization-based approaches. Together, these branches capture both the 'why' and the 'how' of epsilon-local invariance: one side investigates the theoretical implications for value learning and policy geometry, while the other develops concrete training recipes to achieve robustness. A particularly active line of work explores the tension between enforcing local invariance and preserving the policy's ability to distinguish meaningful state differences. Losing Inherent Counterfactuality[0] sits within the Counterfactual Value Learning and Policy Manifolds cluster, emphasizing how epsilon-local training can inadvertently suppress the counterfactual signals that guide effective exploration and credit assignment. This contrasts with approaches like Active Adversarial Training[1], which prioritize robustness by explicitly injecting perturbations during learning, potentially at the cost of nuanced counterfactual reasoning. The central open question is whether one can design training schemes that simultaneously maintain local invariance for robustness and retain the rich counterfactual structure needed for sample-efficient learning, or whether these goals inherently trade off against one another.

Claimed Contributions

Theoretical analysis of ε-local invariance training effects on Q-functions

0 retrieved papers

The authors present a formal theoretical framework demonstrating that ε-local invariance training fundamentally alters learned value judgments in reinforcement learning. They prove an inherent trade-off between accurate Q-value estimation and robustness, showing that ε-invariant Q-functions overestimate optimal values and misalign counterfactual action rankings.

0 retrieved papers

Discovery that standard RL possesses inherent counterfactual reasoning ability

10 retrieved papers

The authors establish that standard reinforcement learning naturally learns counterfactual values aligned with human decision-making processes, while ε-invariance training methods cause policies to lose this inherent counterfactual ability, resulting in inaccurate, inconsistent, and misaligned value functions.

10 retrieved papers

Identification of fundamental trade-off between counterfactuality and robustness

10 retrieved papers

The authors formalize and demonstrate through theory and experiments a fundamental trade-off showing that certified ε-invariance training sacrifices the inherent counterfactual reasoning capabilities of standard RL in pursuit of robustness guarantees, revealing core mechanisms behind this phenomenon.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of ε-local invariance training effects on Q-functions

Contribution

Discovery that standard RL possesses inherent counterfactual reasoning ability

[2] Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning PDF

Cannot Refute

[3] Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning PDF

Cannot Refute

[4] Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning PDF

Cannot Refute

[5] Reasoning about Counterfactuals to Improve Human Inverse Reinforcement Learning PDF

Cannot Refute

[6] Fast Counterfactual Inference for History-Based Reinforcement Learning PDF

Cannot Refute

[7] SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies PDF

Cannot Refute

[8] Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning PDF

Cannot Refute

[9] On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning PDF

Cannot Refute

[10] Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning PDF

Cannot Refute

[11] Counterfactual influence in Markov decision processes PDF

Cannot Refute

Contribution

Identification of fundamental trade-off between counterfactuality and robustness

[9] On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning PDF

Cannot Refute

[12] Promoting counterfactual robustness through diversity PDF

Cannot Refute

[13] Robust Counterfactual Inference in Markov Decision Processes PDF

Cannot Refute

[14] Causal Counterfactuals for Improving the Robustness of Reinforcement Learning PDF

Cannot Refute

[15] Regret Minimization for Partially Observable Deep Reinforcement Learning PDF

Cannot Refute

[16] Budgeting Counterfactual for Offline RL PDF

Cannot Refute

[17] Masked Images Are Counterfactual Samples for Robust Fine-Tuning PDF

Cannot Refute

[18] Bayesian Uncertainty Estimation for Targeted Counterfactual Experience Generation in Reinforcement Learning PDF

Cannot Refute

[19] Generating robust counterfactual explanations PDF

Cannot Refute

[20] User Retention-oriented Recommendation with Decision Transformer PDF

Cannot Refute

How to Lose Inherent Counterfactuality in Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Theoretical analysis of ε-local invariance training effects on Q-functions

Discovery that standard RL possesses inherent counterfactual reasoning ability

[2] Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning PDF

[3] Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning PDF

[4] Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning PDF

[5] Reasoning about Counterfactuals to Improve Human Inverse Reinforcement Learning PDF

[6] Fast Counterfactual Inference for History-Based Reinforcement Learning PDF

[7] SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies PDF

[8] Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning PDF

[9] On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning PDF

[10] Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning PDF

[11] Counterfactual influence in Markov decision processes PDF

Identification of fundamental trade-off between counterfactuality and robustness

[9] On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning PDF

[12] Promoting counterfactual robustness through diversity PDF

[13] Robust Counterfactual Inference in Markov Decision Processes PDF

[14] Causal Counterfactuals for Improving the Robustness of Reinforcement Learning PDF

[15] Regret Minimization for Partially Observable Deep Reinforcement Learning PDF

[16] Budgeting Counterfactual for Offline RL PDF

[17] Masked Images Are Counterfactual Samples for Robust Fine-Tuning PDF

[18] Bayesian Uncertainty Estimation for Targeted Counterfactual Experience Generation in Reinforcement Learning PDF

[19] Generating robust counterfactual explanations PDF

[20] User Retention-oriented Recommendation with Decision Transformer PDF

Table of Contents