Causally-Enhanced Reinforcement Policy Optimization of Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Large Language ModelsReinforcement LearningCausal Inference

Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt ( $Z$ ) to rationale ( $X$ ) to answer ( $Y$ ). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation--causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49 % on average (up to 9.58 %), while improving robustness to correlation–causation flips and light counterfactual edits.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CE-PO, a reward-shaping framework that integrates causal coherence signals into policy optimization for LLM reasoning. It resides in the 'Causal and Faithful Reasoning' leaf of the taxonomy, which currently contains no sibling papers among the 50 surveyed works. This isolation suggests the leaf represents an emerging or under-explored research direction within the broader field of LLM reasoning coherence. The taxonomy positions this work as distinct from reward engineering approaches that combine multiple objectives without explicit causal grounding, and from consistency methods that rely on self-supervised signals rather than causal pathway analysis.

The taxonomy reveals several neighboring research directions. 'Reward Engineering and Shaping' includes multi-objective reward frameworks and self-supervised consistency methods, while 'Core RL Training Methodologies' encompasses policy optimization algorithms and training dynamics studies. The 'Evaluation, Analysis, and Theoretical Foundations' branch contains work on logical consistency and coherence measurement. CE-PO bridges these areas by embedding causal analysis into reward shaping, distinguishing itself from general multi-objective approaches through its focus on prompt-rationale-answer causal pathways and from consistency methods through its use of Jacobian-based influence estimation rather than trajectory agreement.

Among 21 candidates examined across three contributions, no clearly refuting prior work was identified. The CE-PO framework itself was compared against 10 candidates with no refutations found. The counterfactual hardening procedure and Minkowski combiner were similarly examined against 10 and 1 candidates respectively, yielding no overlapping prior work. This limited search scope—focused on top-K semantic matches and citation expansion—suggests the specific combination of Jacobian-based causal estimation, counterfactual hardening, and power-mean reward fusion may be novel within the examined literature, though the analysis does not constitute an exhaustive field survey.

Based on the available signals, CE-PO appears to occupy a sparsely populated research direction within the taxonomy. The absence of sibling papers and lack of refuting candidates among 21 examined works suggest potential novelty, though the limited search scope prevents definitive claims about the broader literature. The framework's integration of causal coherence into policy optimization represents a distinct approach compared to neighboring reward shaping and consistency methods identified in the taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Improving reasoning coherence in large language model reinforcement learning. The field has organized itself around several complementary branches that together address how to train, structure, and evaluate LLMs for more coherent reasoning. Core RL Training Methodologies for LLM Reasoning encompasses foundational policy optimization techniques and reward shaping strategies, including works like DeepSeek R1[3] and Search R1[2] that explore how RL can elicit step-by-step reasoning. Reasoning Mechanisms and Architectural Enhancements focuses on structural innovations—such as search-based methods (e.g., Learning Reason Search[1]) and hierarchical or planning-oriented designs—that guide the model's internal reasoning process. Output Quality Control and Efficiency tackles practical concerns like conciseness (Concise Reasoning[5]) and computational cost, while Causal and Faithful Reasoning emphasizes logical consistency, verifiability, and grounding in sound inference. Reasoning Capability Expansion and Application Domains extend these methods to specialized tasks (mathematics, legal reasoning, software engineering), and Evaluation, Analysis, and Theoretical Foundations provide the metrics and conceptual frameworks needed to assess progress. Within this landscape, a particularly active line of work centers on ensuring that reasoning traces are not only correct but also logically coherent and causally grounded. Causal Policy Optimization[0] sits squarely in the Causal and Faithful Reasoning branch, emphasizing the importance of causal structure in policy learning to avoid spurious correlations and maintain step-by-step validity. This contrasts with approaches in Core RL Training that prioritize outcome-based rewards (e.g., DeepSeek R1 Incentivizes[4]) or efficiency-focused methods like Concise Reasoning[5] that may compress reasoning at the expense of transparency. Nearby works such as Logical Reasoning Survey[7] and Logic RL[8] also stress formal consistency, while others like Consistent Paths Truth[34] and Logical Consistency Factchecking[46] explore verification mechanisms. The central tension across these branches is balancing the flexibility of RL-driven exploration with the need for interpretable, faithful reasoning—a challenge that Causal Policy Optimization[0] addresses by embedding causal principles directly into the optimization objective.

Claimed Contributions

Causally-Enhanced Policy Optimization (CE-PO) framework

10 retrieved papers

The authors propose CE-PO, a reward-shaping framework that integrates causal coherence signals into standard policy optimization (PPO/GRPO) without architectural changes. It uses Jacobian-based sensitivities to measure influence along the Z→X→Y reasoning chain and combines these with task accuracy rewards.

10 retrieved papers

Counterfactual hardening procedure for Jacobian signals

10 retrieved papers

The authors introduce a counterfactual hardening technique that removes nuisance directions from raw Jacobian signals by constructing source-side counterfactuals through token reshuffling and projecting out spurious sensitivity subspaces, thereby reducing reward hacking and shortcut learning.

10 retrieved papers

Minkowski power-mean combiner for reward fusion

1 retrieved paper

The authors develop a Minkowski power-mean combiner that fuses task accuracy rewards with causal coherence scores, providing a single tunable parameter to balance the accuracy-coherence trade-off during optimization while maintaining compatibility with standard policy gradient methods.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Causally-Enhanced Policy Optimization (CE-PO) framework

[54] Causally-enhanced reinforcement policy optimization PDF

Cannot Refute

[61] Group Causal Policy Optimization for Post-Training Large Language Models PDF

Cannot Refute

[62] Posterior-grpo: Rewarding reasoning processes in code generation PDF

Cannot Refute

[63] Interpretable reward redistribution in reinforcement learning: A causal approach PDF

Cannot Refute

[64] Integrated Sensing, Computing, Communication, and Control for Time-Sequence-Based Semantic Communications PDF

Cannot Refute

[65] IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling PDF

Cannot Refute

[66] COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs PDF

Cannot Refute

[67] Complex Question Decomposition Based on Causal Reinforcement Learning PDF

Cannot Refute

[68] Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction PDF

Cannot Refute

[69] KARMA: Knowledge-Aware Reward Mechanism Adjustment via Causal AI PDF

Cannot Refute

Contribution

Counterfactual hardening procedure for Jacobian signals

[51] Spurious correlations in machine learning: A survey PDF

Cannot Refute

[52] Mitigating Spurious Correlations via Counterfactual Contrastive Learning PDF

Cannot Refute

[53] Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests PDF

Cannot Refute

[54] Causally-enhanced reinforcement policy optimization PDF

Cannot Refute

[55] Counterfactual Invariance to Spurious Correlations in Text Classification PDF

Cannot Refute

[56] Beyond reward hacking: Causal rewards for large language model alignment PDF

Cannot Refute

[57] Adversarial counterfactual augmentation: application in Alzheimerâs disease classification PDF

Cannot Refute

[58] CCG: Rare-Label Prediction via Neural SEMâDriven Causal Game PDF

Cannot Refute

[59] Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models PDF

Cannot Refute

[60] Mitigating Spurious Correlations with Causal Logit Perturbation PDF

Cannot Refute

Contribution

Minkowski power-mean combiner for reward fusion

[54] Causally-enhanced reinforcement policy optimization PDF

Cannot Refute

Causally-Enhanced Reinforcement Policy Optimization of Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Causally-Enhanced Policy Optimization (CE-PO) framework

[54] Causally-enhanced reinforcement policy optimization PDF

[61] Group Causal Policy Optimization for Post-Training Large Language Models PDF

[62] Posterior-grpo: Rewarding reasoning processes in code generation PDF

[63] Interpretable reward redistribution in reinforcement learning: A causal approach PDF

[64] Integrated Sensing, Computing, Communication, and Control for Time-Sequence-Based Semantic Communications PDF

[65] IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling PDF

[66] COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs PDF

[67] Complex Question Decomposition Based on Causal Reinforcement Learning PDF

[68] Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction PDF

[69] KARMA: Knowledge-Aware Reward Mechanism Adjustment via Causal AI PDF

Counterfactual hardening procedure for Jacobian signals

[51] Spurious correlations in machine learning: A survey PDF

[52] Mitigating Spurious Correlations via Counterfactual Contrastive Learning PDF

[53] Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests PDF

[54] Causally-enhanced reinforcement policy optimization PDF

[55] Counterfactual Invariance to Spurious Correlations in Text Classification PDF

[56] Beyond reward hacking: Causal rewards for large language model alignment PDF

[57] Adversarial counterfactual augmentation: application in Alzheimerâs disease classification PDF

[58] CCG: Rare-Label Prediction via Neural SEMâDriven Causal Game PDF

[59] Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models PDF

[60] Mitigating Spurious Correlations with Causal Logit Perturbation PDF

Minkowski power-mean combiner for reward fusion

[54] Causally-enhanced reinforcement policy optimization PDF

Table of Contents

[57] Adversarial counterfactual augmentation: application in Alzheimerâs disease classification PDF

[58] CCG: Rare-Label Prediction via Neural SEMâDriven Causal Game PDF