LLMs Must Think Thrice to Solve Executable Counterfactuals

ICLR 2026 Conference SubmissionAnonymous Authors
Counterfactual ReasoningLarge Language ModelsReinforcement LearningGeneralization
Abstract:

Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternative situations (interventions), and predicting the outcomes of the alternatives (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research and healthcare. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to over-estimated LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a new frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for state-of-the-art models such as o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-condition and test on out-of-distribution code structures (e.g., having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While Supervised Finetuning (SFT) on stronger models' reasoning traces improves in-distribution performance of Qwen models, it leads to a decrease in accuracy on out-of-distribution tasks such as counterfactual math problems. In contrast, reinforcement learning (RL) induces the core cognitive behaviors and generalizes to new distributions, yielding substantial accuracy gains over the base model on both code (improvement of 1.5x-2x) and counterfactual math problems. Analysis of the reasoning traces further reinforces these findings and highlights the promise of RL with scalable data generation for improving LLMs' counterfactual reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces executable counterfactuals, a framework operationalizing counterfactual reasoning through code and math problems that explicitly require abduction, intervention, and prediction steps. It sits within the Formal Counterfactual Reasoning Benchmarks leaf, which contains four papers including siblings like CRASS and MalAlgoQA. This represents a moderately populated research direction within the broader Benchmark Development and Evaluation Frameworks branch, suggesting active but not overcrowded exploration of structured counterfactual evaluation methods.

The taxonomy reveals neighboring work in Causal Reasoning and Ladder-of-Causation Benchmarks, which assess reasoning across Pearl's causal hierarchy, and Natural Language Counterfactual Benchmarks, focused on story rewriting and hypothetical scenarios. The paper's emphasis on executable frameworks and formal structure distinguishes it from natural language approaches while complementing causal hierarchy assessments. The scope note for its leaf explicitly excludes commonsense or natural language counterfactuals, positioning this work at the intersection of formal reasoning and computational executability rather than linguistic understanding.

Among thirty candidates examined, the Executable Counterfactuals Framework contribution shows one refutable candidate from ten examined, while the Template-Based Code Generation Method found no refutations across ten candidates. The RL superiority demonstration encountered two refutable candidates from ten examined. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive prior work coverage. The code generation method appears more novel within this sample, while the framework and training comparisons face more substantial existing work in the examined literature.

Based on the limited thirty-candidate search, the work appears to occupy a recognizable but not densely populated niche within formal counterfactual evaluation. The executable framework concept and template-based generation show varying degrees of prior overlap, with the generation method appearing less anticipated in the examined literature. The analysis covers top-semantic matches and does not claim comprehensive field coverage, leaving open questions about broader landscape positioning.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: counterfactual reasoning in large language models. The field has organized itself around several complementary directions. Benchmark Development and Evaluation Frameworks focuses on creating rigorous testbeds for assessing counterfactual capabilities, ranging from formal reasoning benchmarks like Cladder[2] and CRASS[21] to domain-specific evaluations such as MalAlgoQA[23]. Capabilities and Limitations Analysis investigates what models can and cannot do, examining issues like memorization versus genuine reasoning (Counterfactual Memorization[3], Reasoning or Reciting[13]) and struggles with parametric knowledge conflicts. Training and Improvement Methodologies explores techniques to enhance counterfactual abilities through specialized training regimes or architectural modifications. Domain-Specific Applications adapts counterfactual reasoning to areas like finance (FinCARE[46]), autonomous driving (OmniDrive[49]), and vision-language tasks (CausalVLBench[26]). Explainability and Interpretability Applications leverages counterfactuals for model understanding and faithful explanations. Related Reasoning and Modeling Studies connects to broader causal inference frameworks and world modeling efforts. Recent work reveals tension between formal causal reasoning and practical language understanding. Many studies probe whether models truly grasp causal structure or merely exploit surface patterns, with benchmarks like Counterbench[1] and Unveiling Causal Reasoning[4] exposing systematic failures. Think Thrice[0] sits within the Formal Counterfactual Reasoning Benchmarks cluster, emphasizing structured evaluation of counterfactual inference capabilities. Compared to neighbors like CRASS[21], which targets commonsense abductive reasoning, and MalAlgoQA[23], which focuses on algorithmic counterfactuals, Think Thrice[0] appears to prioritize systematic assessment of reasoning depth. This positioning reflects ongoing debates about whether current models perform genuine counterfactual inference or rely on shallow heuristics, a question that cuts across evaluation frameworks and capability analyses throughout the taxonomy.

Claimed Contributions

Executable Counterfactuals Framework

The authors propose a framework that uses code understanding and math problems to evaluate counterfactual reasoning by explicitly requiring all three steps: abduction (inferring latent variables), intervention (constructing alternatives), and prediction (computing outcomes). This framework enables scalable synthetic data creation with controllable difficulty and provides verifiable ground-truth outcomes.

10 retrieved papers
Can Refute
Template-Based Code Generation Method

The authors develop a template-based approach with three levels of placeholders (fixed, structural, and value) that generates diverse executable functions for training and evaluation. This method enables controlled complexity variation and out-of-distribution testing across different code structures and causal configurations.

10 retrieved papers
Demonstration of RL Superiority Over SFT for Counterfactual Reasoning

The authors show through experiments that reinforcement learning with verifiable rewards successfully elicits generalizable counterfactual reasoning skills that transfer across different code structures and from code to natural language math problems, while supervised finetuning fails to generalize beyond in-distribution tasks despite using stronger models' reasoning traces.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Executable Counterfactuals Framework

The authors propose a framework that uses code understanding and math problems to evaluate counterfactual reasoning by explicitly requiring all three steps: abduction (inferring latent variables), intervention (constructing alternatives), and prediction (computing outcomes). This framework enables scalable synthetic data creation with controllable difficulty and provides verifiable ground-truth outcomes.

Contribution

Template-Based Code Generation Method

The authors develop a template-based approach with three levels of placeholders (fixed, structural, and value) that generates diverse executable functions for training and evaluation. This method enables controlled complexity variation and out-of-distribution testing across different code structures and causal configurations.

Contribution

Demonstration of RL Superiority Over SFT for Counterfactual Reasoning

The authors show through experiments that reinforcement learning with verifiable rewards successfully elicits generalizable counterfactual reasoning skills that transfer across different code structures and from code to natural language math problems, while supervised finetuning fails to generalize beyond in-distribution tasks despite using stronger models' reasoning traces.

LLMs Must Think Thrice to Solve Executable Counterfactuals | Novelty Validation