DAG-Math: Graph-Guided Mathematical Reasoning in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
LLMsmathematical reasoningdirected acyclic graphs
Abstract:

Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model’s CoT trajectory (i.e., the LLM's output) adheres to the DAG structure, providing evaluation beyond classical PASS@kk metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families—even when PASS@kk is comparable—highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes modeling Chain-of-Thought reasoning as a rule-based stochastic process over directed acyclic graphs, introducing a 'logical closeness' metric to evaluate adherence to derivation rules beyond final-answer accuracy. It resides in the 'Rule-Based Reasoning Fidelity Metrics' leaf, which contains only two papers total (including this one and DAG-Think-Twice). This represents a relatively sparse research direction within the broader taxonomy of 28 papers, suggesting the specific focus on rule-fidelity metrics for mathematical reasoning is an emerging rather than crowded area.

The taxonomy reveals neighboring evaluation approaches in sibling leaves: 'Dynamic Benchmark Generation via DAGs' focuses on test sample creation to avoid contamination, 'Step-Level Reasoning Verification' validates individual reasoning steps, and 'Semantic Structure Analysis' parses traces into DAGs for pattern characterization. The parent category 'DAG-Based Evaluation and Verification Methods' encompasses five distinct evaluation philosophies, while adjacent branches like 'DAG-Based Reasoning Frameworks' and 'DAG-Guided Data Synthesis' address generation rather than assessment. The scope_note clarifies this leaf specifically measures trajectory adherence to DAG-encoded rules, distinguishing it from semantic parsing or dynamic benchmarking approaches.

Among 30 candidates examined, the logical closeness metric (Contribution 2) shows one refutable candidate from 10 examined, indicating some prior work on fidelity measurement exists within this limited search scope. The DAG-based framework (Contribution 1) and DAG-MATH benchmark (Contribution 3) each examined 10 candidates with zero refutations, suggesting these contributions may occupy less-explored territory among the top-30 semantic matches. The statistics indicate modest overlap for the metric component while the framework and benchmark appear more distinctive within this constrained candidate pool, though the search scope does not cover the entire field exhaustively.

Based on the limited top-30 semantic search, the work appears to introduce a relatively novel evaluation perspective within a sparse taxonomy leaf, though one contribution shows measurable prior overlap. The analysis covers semantic neighbors and citation-expanded candidates but does not constitute comprehensive field coverage, leaving open the possibility of additional related work beyond the examined scope.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Evaluating mathematical reasoning in large language models using directed acyclic graphs. The field has organized itself around several complementary branches that collectively address how DAG structures can enhance both the generation and assessment of mathematical reasoning. DAG-Based Reasoning Frameworks and Architectures develop novel computational structures that explicitly model reasoning steps as graph nodes, enabling more transparent multi-step inference. DAG-Based Evaluation and Verification Methods focus on measuring reasoning fidelity and correctness by leveraging dependency structures inherent in mathematical proofs. DAG-Guided Data Synthesis and Training explore how graph-based templates can generate high-quality training data that respects logical dependencies. Meanwhile, Causal Reasoning and DAG-Based Causal Inference apply causal graph theory to understand and improve model behavior, and Graph-Augmented Reasoning and Retrieval integrate external knowledge graphs to support complex problem-solving. Premise-Based and Dependency-Aware Reasoning emphasizes tracking logical prerequisites, while Theoretical Foundations of Length Generalization investigates the conditions under which models can extend reasoning to longer chains. Several active lines of work reveal key trade-offs between structural expressiveness and computational tractability. DAG-Math[3] and related evaluation frameworks emphasize rigorous verification of reasoning paths, contrasting with data synthesis approaches that prioritize scalable generation of diverse problem instances. The original paper ```json[0] situates itself within the Rule-Based Reasoning Fidelity Metrics cluster, closely aligned with DAG-Think-Twice[28], both focusing on measuring how faithfully models adhere to prescribed logical rules during multi-step reasoning. This contrasts with broader graph-augmented methods like Graph Chain-of-Thought[9] or ReasoningFlow[12], which integrate external knowledge but may sacrifice fine-grained fidelity assessment. A central open question across these branches is how to balance the interpretability gains from explicit DAG structures against the flexibility needed for open-ended mathematical discovery, with ongoing work exploring hybrid architectures that combine symbolic verification with neural adaptability.

Claimed Contributions

DAG-based framework for modeling CoT as a rule-based stochastic process

The authors formalize Chain-of-Thought reasoning in two phases: Phase 1 constructs a task-specific DAG as the search space, and Phase 2 generates CoT trajectories over this DAG under stochastic transition rules. This framework captures long-range dependencies and goal-directed reasoning while addressing limitations of prior graph-based models.

10 retrieved papers
Logical closeness metric and perfect reasoning rate (PRR)

The authors introduce logical closeness to evaluate whether an LLM solves problems through rigorous logical inference rather than search. This yields a new evaluation metric called the perfect reasoning rate (PRR) and related AUC scores, distinguishing final-answer accuracy from rule-consistent derivation.

10 retrieved papers
Can Refute
DAG-MATH benchmark with 2,894 gold-standard problems

The authors propose the DAG-MATH format that makes the logical structure of CoT explicit through DAG representations. Using a three-stage prompting method, they construct a benchmark of 2,894 gold-standard DAGs from existing mathematical datasets, enabling systematic evaluation of reasoning fidelity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DAG-based framework for modeling CoT as a rule-based stochastic process

The authors formalize Chain-of-Thought reasoning in two phases: Phase 1 constructs a task-specific DAG as the search space, and Phase 2 generates CoT trajectories over this DAG under stochastic transition rules. This framework captures long-range dependencies and goal-directed reasoning while addressing limitations of prior graph-based models.

Contribution

Logical closeness metric and perfect reasoning rate (PRR)

The authors introduce logical closeness to evaluate whether an LLM solves problems through rigorous logical inference rather than search. This yields a new evaluation metric called the perfect reasoning rate (PRR) and related AUC scores, distinguishing final-answer accuracy from rule-consistent derivation.

Contribution

DAG-MATH benchmark with 2,894 gold-standard problems

The authors propose the DAG-MATH format that makes the logical structure of CoT explicit through DAG representations. Using a three-stage prompting method, they construct a benchmark of 2,894 gold-standard DAGs from existing mathematical datasets, enabling systematic evaluation of reasoning fidelity.