LLMs Must Think Thrice to Solve Executable Counterfactuals

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Counterfactual ReasoningLarge Language ModelsReinforcement LearningGeneralization

Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternative situations (interventions), and predicting the outcomes of the alternatives (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research and healthcare. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to over-estimated LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a new frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for state-of-the-art models such as o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-condition and test on out-of-distribution code structures (e.g., having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While Supervised Finetuning (SFT) on stronger models' reasoning traces improves in-distribution performance of Qwen models, it leads to a decrease in accuracy on out-of-distribution tasks such as counterfactual math problems. In contrast, reinforcement learning (RL) induces the core cognitive behaviors and generalizes to new distributions, yielding substantial accuracy gains over the base model on both code (improvement of 1.5x-2x) and counterfactual math problems. Analysis of the reasoning traces further reinforces these findings and highlights the promise of RL with scalable data generation for improving LLMs' counterfactual reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces executable counterfactuals, a framework operationalizing counterfactual reasoning through code and math problems that explicitly require abduction, intervention, and prediction steps. It sits within the Formal Counterfactual Reasoning Benchmarks leaf, which contains four papers including siblings like CRASS and MalAlgoQA. This represents a moderately populated research direction within the broader Benchmark Development and Evaluation Frameworks branch, suggesting active but not overcrowded exploration of structured counterfactual evaluation methods.

The taxonomy reveals neighboring work in Causal Reasoning and Ladder-of-Causation Benchmarks, which assess reasoning across Pearl's causal hierarchy, and Natural Language Counterfactual Benchmarks, focused on story rewriting and hypothetical scenarios. The paper's emphasis on executable frameworks and formal structure distinguishes it from natural language approaches while complementing causal hierarchy assessments. The scope note for its leaf explicitly excludes commonsense or natural language counterfactuals, positioning this work at the intersection of formal reasoning and computational executability rather than linguistic understanding.

Among thirty candidates examined, the Executable Counterfactuals Framework contribution shows one refutable candidate from ten examined, while the Template-Based Code Generation Method found no refutations across ten candidates. The RL superiority demonstration encountered two refutable candidates from ten examined. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive prior work coverage. The code generation method appears more novel within this sample, while the framework and training comparisons face more substantial existing work in the examined literature.

Based on the limited thirty-candidate search, the work appears to occupy a recognizable but not densely populated niche within formal counterfactual evaluation. The executable framework concept and template-based generation show varying degrees of prior overlap, with the generation method appearing less anticipated in the examined literature. The analysis covers top-semantic matches and does not claim comprehensive field coverage, leaving open questions about broader landscape positioning.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: counterfactual reasoning in large language models. The field has organized itself around several complementary directions. Benchmark Development and Evaluation Frameworks focuses on creating rigorous testbeds for assessing counterfactual capabilities, ranging from formal reasoning benchmarks like Cladder[2] and CRASS[21] to domain-specific evaluations such as MalAlgoQA[23]. Capabilities and Limitations Analysis investigates what models can and cannot do, examining issues like memorization versus genuine reasoning (Counterfactual Memorization[3], Reasoning or Reciting[13]) and struggles with parametric knowledge conflicts. Training and Improvement Methodologies explores techniques to enhance counterfactual abilities through specialized training regimes or architectural modifications. Domain-Specific Applications adapts counterfactual reasoning to areas like finance (FinCARE[46]), autonomous driving (OmniDrive[49]), and vision-language tasks (CausalVLBench[26]). Explainability and Interpretability Applications leverages counterfactuals for model understanding and faithful explanations. Related Reasoning and Modeling Studies connects to broader causal inference frameworks and world modeling efforts. Recent work reveals tension between formal causal reasoning and practical language understanding. Many studies probe whether models truly grasp causal structure or merely exploit surface patterns, with benchmarks like Counterbench[1] and Unveiling Causal Reasoning[4] exposing systematic failures. Think Thrice[0] sits within the Formal Counterfactual Reasoning Benchmarks cluster, emphasizing structured evaluation of counterfactual inference capabilities. Compared to neighbors like CRASS[21], which targets commonsense abductive reasoning, and MalAlgoQA[23], which focuses on algorithmic counterfactuals, Think Thrice[0] appears to prioritize systematic assessment of reasoning depth. This positioning reflects ongoing debates about whether current models perform genuine counterfactual inference or rely on shallow heuristics, a question that cuts across evaluation frameworks and capability analyses throughout the taxonomy.

Claimed Contributions

Executable Counterfactuals Framework

Can Refute

10 retrieved papers

The authors propose a framework that uses code understanding and math problems to evaluate counterfactual reasoning by explicitly requiring all three steps: abduction (inferring latent variables), intervention (constructing alternatives), and prediction (computing outcomes). This framework enables scalable synthetic data creation with controllable difficulty and provides verifiable ground-truth outcomes.

10 retrieved papers

Can Refute

Template-Based Code Generation Method

10 retrieved papers

The authors develop a template-based approach with three levels of placeholders (fixed, structural, and value) that generates diverse executable functions for training and evaluation. This method enables controlled complexity variation and out-of-distribution testing across different code structures and causal configurations.

10 retrieved papers

Demonstration of RL Superiority Over SFT for Counterfactual Reasoning

Can Refute

10 retrieved papers

The authors show through experiments that reinforcement learning with verifiable rewards successfully elicits generalizable counterfactual reasoning skills that transfer across different code structures and from code to natural language math problems, while supervised finetuning fails to generalize beyond in-distribution tasks despite using stronger models' reasoning traces.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Counterbench: A benchmark for counterfactuals reasoning in large language models PDF

Chen, Yuefei, Singh, Vivek K., Yuefei Chen, Ma Jing, Vivek K. Singh, Jing Ma, Ruxiang Tang (2025)

[21] CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models PDF

Frohberg, JÃ¶rg, JÃ¶rg Frohberg, Binder, Frank, Frank Binder, Jorg Frohberg (2022)

[23] MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education PDF

Shashank Sonkar, Naiming Liu, Richard Baraniuk, Myco Le, Richard G. Baraniuk (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Executable Counterfactuals Framework

[69] Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code PDF

Can Refute

[71] Can ChatGPT make explanatory inferences? Benchmarks for abductive reasoning PDF

Cannot Refute

[72] Abductive framework for counterfactual reasoning in logic programming PDF

Cannot Refute

[73] A novel approach to the relationships between data features--based on comprehensive examination of mathematical, technological, and causal methodology PDF

Cannot Refute

[74] Formalizing Informal Logic and Natural Language Deductivism. PDF

Cannot Refute

[75] An Abductive Counterfactual Reasoning Approach in Logic Programming PDF

Cannot Refute

[76] Programming machine ethics PDF

Cannot Refute

[77] The theory of Boolean counterfactual reasoning PDF

Cannot Refute

[78] Counterfactuals in logic programming PDF

Cannot Refute

[79] Counterfactual Evolution of Multimodal Datasets via Visual Programming PDF

Cannot Refute

Contribution

Template-Based Code Generation Method

[51] ExploitGen: Template-augmented exploit code generation based on CodeBERT PDF

Cannot Refute

[52] Code-centric code generation PDF

Cannot Refute

[53] Can ChatGPT Replace a Template-based Code Generator? PDF

Cannot Refute

[54] Gamma: Revisiting template-based automated program repair via mask prediction PDF

Cannot Refute

[55] An Automatic Front-End Code Generation Method Based on Data and Template Integration PDF

Cannot Refute

[56] Template-based AADL automatic code generation PDF

Cannot Refute

[57] Principled syntactic code completion using placeholders PDF

Cannot Refute

[58] Mike code generation platform - User guide PDF

Cannot Refute

[59] Template-based code generation for a customizable high-performance hyperbolic PDE engine PDF

Cannot Refute

[60] Program Generation Methods: Types and Instances PDF

Cannot Refute

Contribution

Demonstration of RL Superiority Over SFT for Counterfactual Reasoning

[69] Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code PDF

Can Refute

[70] Generalization of RLVR Using Causal Reasoning as a Testbed PDF

Can Refute

[61] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training PDF

Cannot Refute

[62] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning PDF

Cannot Refute

[63] Reinforcing pre-trained models using counterfactual images PDF

Cannot Refute

[64] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

Cannot Refute

[65] DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning PDF

Cannot Refute

[66] Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding PDF

Cannot Refute

[67] Understanding and Mitigating Goal Misgeneralization in Language Models PDF

Cannot Refute

[68] Counterfactual vision-and-language navigation: Unravelling the unseen PDF

Cannot Refute

LLMs Must Think Thrice to Solve Executable Counterfactuals

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Counterbench: A benchmark for counterfactuals reasoning in large language models PDF

[21] CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models PDF

[23] MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education PDF

Contribution Analysis

Executable Counterfactuals Framework

[69] Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code PDF

[71] Can ChatGPT make explanatory inferences? Benchmarks for abductive reasoning PDF

[72] Abductive framework for counterfactual reasoning in logic programming PDF

[73] A novel approach to the relationships between data features--based on comprehensive examination of mathematical, technological, and causal methodology PDF

[74] Formalizing Informal Logic and Natural Language Deductivism. PDF

[75] An Abductive Counterfactual Reasoning Approach in Logic Programming PDF

[76] Programming machine ethics PDF

[77] The theory of Boolean counterfactual reasoning PDF

[78] Counterfactuals in logic programming PDF

[79] Counterfactual Evolution of Multimodal Datasets via Visual Programming PDF

Template-Based Code Generation Method

[51] ExploitGen: Template-augmented exploit code generation based on CodeBERT PDF

[52] Code-centric code generation PDF

[53] Can ChatGPT Replace a Template-based Code Generator? PDF

[54] Gamma: Revisiting template-based automated program repair via mask prediction PDF

[55] An Automatic Front-End Code Generation Method Based on Data and Template Integration PDF

[56] Template-based AADL automatic code generation PDF

[57] Principled syntactic code completion using placeholders PDF

[58] Mike code generation platform - User guide PDF

[59] Template-based code generation for a customizable high-performance hyperbolic PDE engine PDF

[60] Program Generation Methods: Types and Instances PDF

Demonstration of RL Superiority Over SFT for Counterfactual Reasoning

[69] Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code PDF

[70] Generalization of RLVR Using Causal Reasoning as a Testbed PDF

[61] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training PDF

[62] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning PDF

[63] Reinforcing pre-trained models using counterfactual images PDF

[64] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

[65] DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning PDF

[66] Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding PDF

[67] Understanding and Mitigating Goal Misgeneralization in Language Models PDF

[68] Counterfactual vision-and-language navigation: Unravelling the unseen PDF

Table of Contents