Causally-Enhanced Reinforcement Policy Optimization of Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsReinforcement LearningCausal Inference
Abstract:

Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (ZZ) to rationale (XX) to answer (YY). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation--causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49 % on average (up to 9.58 %), while improving robustness to correlation–causation flips and light counterfactual edits.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CE-PO, a reward-shaping framework that integrates causal coherence signals into policy optimization for LLM reasoning. It resides in the 'Causal and Faithful Reasoning' leaf of the taxonomy, which currently contains no sibling papers among the 50 surveyed works. This isolation suggests the leaf represents an emerging or under-explored research direction within the broader field of LLM reasoning coherence. The taxonomy positions this work as distinct from reward engineering approaches that combine multiple objectives without explicit causal grounding, and from consistency methods that rely on self-supervised signals rather than causal pathway analysis.

The taxonomy reveals several neighboring research directions. 'Reward Engineering and Shaping' includes multi-objective reward frameworks and self-supervised consistency methods, while 'Core RL Training Methodologies' encompasses policy optimization algorithms and training dynamics studies. The 'Evaluation, Analysis, and Theoretical Foundations' branch contains work on logical consistency and coherence measurement. CE-PO bridges these areas by embedding causal analysis into reward shaping, distinguishing itself from general multi-objective approaches through its focus on prompt-rationale-answer causal pathways and from consistency methods through its use of Jacobian-based influence estimation rather than trajectory agreement.

Among 21 candidates examined across three contributions, no clearly refuting prior work was identified. The CE-PO framework itself was compared against 10 candidates with no refutations found. The counterfactual hardening procedure and Minkowski combiner were similarly examined against 10 and 1 candidates respectively, yielding no overlapping prior work. This limited search scope—focused on top-K semantic matches and citation expansion—suggests the specific combination of Jacobian-based causal estimation, counterfactual hardening, and power-mean reward fusion may be novel within the examined literature, though the analysis does not constitute an exhaustive field survey.

Based on the available signals, CE-PO appears to occupy a sparsely populated research direction within the taxonomy. The absence of sibling papers and lack of refuting candidates among 21 examined works suggest potential novelty, though the limited search scope prevents definitive claims about the broader literature. The framework's integration of causal coherence into policy optimization represents a distinct approach compared to neighboring reward shaping and consistency methods identified in the taxonomy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Improving reasoning coherence in large language model reinforcement learning. The field has organized itself around several complementary branches that together address how to train, structure, and evaluate LLMs for more coherent reasoning. Core RL Training Methodologies for LLM Reasoning encompasses foundational policy optimization techniques and reward shaping strategies, including works like DeepSeek R1[3] and Search R1[2] that explore how RL can elicit step-by-step reasoning. Reasoning Mechanisms and Architectural Enhancements focuses on structural innovations—such as search-based methods (e.g., Learning Reason Search[1]) and hierarchical or planning-oriented designs—that guide the model's internal reasoning process. Output Quality Control and Efficiency tackles practical concerns like conciseness (Concise Reasoning[5]) and computational cost, while Causal and Faithful Reasoning emphasizes logical consistency, verifiability, and grounding in sound inference. Reasoning Capability Expansion and Application Domains extend these methods to specialized tasks (mathematics, legal reasoning, software engineering), and Evaluation, Analysis, and Theoretical Foundations provide the metrics and conceptual frameworks needed to assess progress. Within this landscape, a particularly active line of work centers on ensuring that reasoning traces are not only correct but also logically coherent and causally grounded. Causal Policy Optimization[0] sits squarely in the Causal and Faithful Reasoning branch, emphasizing the importance of causal structure in policy learning to avoid spurious correlations and maintain step-by-step validity. This contrasts with approaches in Core RL Training that prioritize outcome-based rewards (e.g., DeepSeek R1 Incentivizes[4]) or efficiency-focused methods like Concise Reasoning[5] that may compress reasoning at the expense of transparency. Nearby works such as Logical Reasoning Survey[7] and Logic RL[8] also stress formal consistency, while others like Consistent Paths Truth[34] and Logical Consistency Factchecking[46] explore verification mechanisms. The central tension across these branches is balancing the flexibility of RL-driven exploration with the need for interpretable, faithful reasoning—a challenge that Causal Policy Optimization[0] addresses by embedding causal principles directly into the optimization objective.

Claimed Contributions

Causally-Enhanced Policy Optimization (CE-PO) framework

The authors propose CE-PO, a reward-shaping framework that integrates causal coherence signals into standard policy optimization (PPO/GRPO) without architectural changes. It uses Jacobian-based sensitivities to measure influence along the Z→X→Y reasoning chain and combines these with task accuracy rewards.

10 retrieved papers
Counterfactual hardening procedure for Jacobian signals

The authors introduce a counterfactual hardening technique that removes nuisance directions from raw Jacobian signals by constructing source-side counterfactuals through token reshuffling and projecting out spurious sensitivity subspaces, thereby reducing reward hacking and shortcut learning.

10 retrieved papers
Minkowski power-mean combiner for reward fusion

The authors develop a Minkowski power-mean combiner that fuses task accuracy rewards with causal coherence scores, providing a single tunable parameter to balance the accuracy-coherence trade-off during optimization while maintaining compatibility with standard policy gradient methods.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Causally-Enhanced Policy Optimization (CE-PO) framework

The authors propose CE-PO, a reward-shaping framework that integrates causal coherence signals into standard policy optimization (PPO/GRPO) without architectural changes. It uses Jacobian-based sensitivities to measure influence along the Z→X→Y reasoning chain and combines these with task accuracy rewards.

Contribution

Counterfactual hardening procedure for Jacobian signals

The authors introduce a counterfactual hardening technique that removes nuisance directions from raw Jacobian signals by constructing source-side counterfactuals through token reshuffling and projecting out spurious sensitivity subspaces, thereby reducing reward hacking and shortcut learning.

Contribution

Minkowski power-mean combiner for reward fusion

The authors develop a Minkowski power-mean combiner that fuses task accuracy rewards with causal coherence scores, providing a single tunable parameter to balance the accuracy-coherence trade-off during optimization while maintaining compatibility with standard policy gradient methods.