DAG-Math: Graph-Guided Mathematical Reasoning in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMsmathematical reasoningdirected acyclic graphs

Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model’s CoT trajectory (i.e., the LLM's output) adheres to the DAG structure, providing evaluation beyond classical PASS@ $k$ metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families—even when PASS@ $k$ is comparable—highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes modeling Chain-of-Thought reasoning as a rule-based stochastic process over directed acyclic graphs, introducing a 'logical closeness' metric to evaluate adherence to derivation rules beyond final-answer accuracy. It resides in the 'Rule-Based Reasoning Fidelity Metrics' leaf, which contains only two papers total (including this one and DAG-Think-Twice). This represents a relatively sparse research direction within the broader taxonomy of 28 papers, suggesting the specific focus on rule-fidelity metrics for mathematical reasoning is an emerging rather than crowded area.

The taxonomy reveals neighboring evaluation approaches in sibling leaves: 'Dynamic Benchmark Generation via DAGs' focuses on test sample creation to avoid contamination, 'Step-Level Reasoning Verification' validates individual reasoning steps, and 'Semantic Structure Analysis' parses traces into DAGs for pattern characterization. The parent category 'DAG-Based Evaluation and Verification Methods' encompasses five distinct evaluation philosophies, while adjacent branches like 'DAG-Based Reasoning Frameworks' and 'DAG-Guided Data Synthesis' address generation rather than assessment. The scope_note clarifies this leaf specifically measures trajectory adherence to DAG-encoded rules, distinguishing it from semantic parsing or dynamic benchmarking approaches.

Among 30 candidates examined, the logical closeness metric (Contribution 2) shows one refutable candidate from 10 examined, indicating some prior work on fidelity measurement exists within this limited search scope. The DAG-based framework (Contribution 1) and DAG-MATH benchmark (Contribution 3) each examined 10 candidates with zero refutations, suggesting these contributions may occupy less-explored territory among the top-30 semantic matches. The statistics indicate modest overlap for the metric component while the framework and benchmark appear more distinctive within this constrained candidate pool, though the search scope does not cover the entire field exhaustively.

Based on the limited top-30 semantic search, the work appears to introduce a relatively novel evaluation perspective within a sparse taxonomy leaf, though one contribution shows measurable prior overlap. The analysis covers semantic neighbors and citation-expanded candidates but does not constitute comprehensive field coverage, leaving open the possibility of additional related work beyond the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating mathematical reasoning in large language models using directed acyclic graphs. The field has organized itself around several complementary branches that collectively address how DAG structures can enhance both the generation and assessment of mathematical reasoning. DAG-Based Reasoning Frameworks and Architectures develop novel computational structures that explicitly model reasoning steps as graph nodes, enabling more transparent multi-step inference. DAG-Based Evaluation and Verification Methods focus on measuring reasoning fidelity and correctness by leveraging dependency structures inherent in mathematical proofs. DAG-Guided Data Synthesis and Training explore how graph-based templates can generate high-quality training data that respects logical dependencies. Meanwhile, Causal Reasoning and DAG-Based Causal Inference apply causal graph theory to understand and improve model behavior, and Graph-Augmented Reasoning and Retrieval integrate external knowledge graphs to support complex problem-solving. Premise-Based and Dependency-Aware Reasoning emphasizes tracking logical prerequisites, while Theoretical Foundations of Length Generalization investigates the conditions under which models can extend reasoning to longer chains. Several active lines of work reveal key trade-offs between structural expressiveness and computational tractability. DAG-Math[3] and related evaluation frameworks emphasize rigorous verification of reasoning paths, contrasting with data synthesis approaches that prioritize scalable generation of diverse problem instances. The original paper ```json[0] situates itself within the Rule-Based Reasoning Fidelity Metrics cluster, closely aligned with DAG-Think-Twice[28], both focusing on measuring how faithfully models adhere to prescribed logical rules during multi-step reasoning. This contrasts with broader graph-augmented methods like Graph Chain-of-Thought[9] or ReasoningFlow[12], which integrate external knowledge but may sacrifice fine-grained fidelity assessment. A central open question across these branches is how to balance the interpretability gains from explicit DAG structures against the flexibility needed for open-ended mathematical discovery, with ongoing work exploring hybrid architectures that combine symbolic verification with neural adaptability.

Claimed Contributions

DAG-based framework for modeling CoT as a rule-based stochastic process

10 retrieved papers

The authors formalize Chain-of-Thought reasoning in two phases: Phase 1 constructs a task-specific DAG as the search space, and Phase 2 generates CoT trajectories over this DAG under stochastic transition rules. This framework captures long-range dependencies and goal-directed reasoning while addressing limitations of prior graph-based models.

10 retrieved papers

Logical closeness metric and perfect reasoning rate (PRR)

Can Refute

10 retrieved papers

The authors introduce logical closeness to evaluate whether an LLM solves problems through rigorous logical inference rather than search. This yields a new evaluation metric called the perfect reasoning rate (PRR) and related AUC scores, distinguishing final-answer accuracy from rule-consistent derivation.

10 retrieved papers

Can Refute

DAG-MATH benchmark with 2,894 gold-standard problems

10 retrieved papers

The authors propose the DAG-MATH format that makes the logical structure of CoT explicit through DAG representations. Using a three-stage prompting method, they construct a benchmark of 2,894 gold-standard DAGs from existing mathematical datasets, enabling systematic evaluation of reasoning fidelity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[28] DAG-MATH: GRAPH-GUIDED MATHEMATICAL REA PDF

SIN LLMS (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DAG-based framework for modeling CoT as a rule-based stochastic process

[49] A Bayesian Nonparametric Stochastic Block Model for Directed Acyclic Graphs PDF

Cannot Refute

[50] Reasoning with probabilistic and deterministic graphical models: Exact algorithms PDF

Cannot Refute

[51] Bayesian model selection of Gaussian directed acyclic graph structures PDF

Cannot Refute

[52] DAG: Projected Stochastic Approximation Iteration for DAG Structure Learning PDF

Cannot Refute

[53] Causal Effect Identification in Cluster DAGs PDF

Cannot Refute

[54] DAG-GNN: DAG Structure Learning with Graph Neural Networks PDF

Cannot Refute

[55] Transformer Based Bayesian Network Embedding for Efficient Multiple Probabilistic Inferences PDF

Cannot Refute

[56] Discovering causal structures in Bayesian Gaussian directed acyclic graph models PDF

Cannot Refute

[57] Characterization of minimal network structures modeling stochastic processes PDF

Cannot Refute

[58] Learning directed acyclic graph models based on sparsest permutations PDF

Cannot Refute

Contribution

Logical closeness metric and perfect reasoning rate (PRR)

[41] Evaluating Mathematical Reasoning Beyond Accuracy PDF

Can Refute

[39] The lessons of developing process reward models in mathematical reasoning PDF

Cannot Refute

[40] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

Cannot Refute

[42] Evaluating consistency and reasoning capabilities of large language models PDF

Cannot Refute

[43] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF

Cannot Refute

[44] Posterior-grpo: Rewarding reasoning processes in code generation PDF

Cannot Refute

[45] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

Cannot Refute

[46] Structured path guidance for logical coherence in large language model generation PDF

Cannot Refute

[47] Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. PDF

Cannot Refute

[48] Evaluating step-by-step reasoning traces: A survey PDF

Cannot Refute

Contribution

DAG-MATH benchmark with 2,894 gold-standard problems

[29] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning PDF

Cannot Refute

[30] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics PDF

Cannot Refute

[31] Benchmarking LLMs on Advanced Mathematical Reasoning PDF

Cannot Refute

[32] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning PDF

Cannot Refute

[33] Peano: learning formal mathematical reasoning PDF

Cannot Refute

[34] UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression PDF

Cannot Refute

[35] Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning PDF

Cannot Refute

[36] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF

Cannot Refute

[37] Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition PDF

Cannot Refute

[38] LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts PDF

Cannot Refute

DAG-Math: Graph-Guided Mathematical Reasoning in LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[28] DAG-MATH: GRAPH-GUIDED MATHEMATICAL REA PDF

Contribution Analysis

DAG-based framework for modeling CoT as a rule-based stochastic process

[49] A Bayesian Nonparametric Stochastic Block Model for Directed Acyclic Graphs PDF

[50] Reasoning with probabilistic and deterministic graphical models: Exact algorithms PDF

[51] Bayesian model selection of Gaussian directed acyclic graph structures PDF

[52] DAG: Projected Stochastic Approximation Iteration for DAG Structure Learning PDF

[53] Causal Effect Identification in Cluster DAGs PDF

[54] DAG-GNN: DAG Structure Learning with Graph Neural Networks PDF

[55] Transformer Based Bayesian Network Embedding for Efficient Multiple Probabilistic Inferences PDF

[56] Discovering causal structures in Bayesian Gaussian directed acyclic graph models PDF

[57] Characterization of minimal network structures modeling stochastic processes PDF

[58] Learning directed acyclic graph models based on sparsest permutations PDF

Logical closeness metric and perfect reasoning rate (PRR)

[41] Evaluating Mathematical Reasoning Beyond Accuracy PDF

[39] The lessons of developing process reward models in mathematical reasoning PDF

[40] Towards reasoning era: A survey of long chain-of-thought for reasoning large language models PDF

[42] Evaluating consistency and reasoning capabilities of large language models PDF

[43] Llamav-o1: Rethinking step-by-step visual reasoning in llms PDF

[44] Posterior-grpo: Rewarding reasoning processes in code generation PDF

[45] R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization PDF

[46] Structured path guidance for logical coherence in large language model generation PDF

[47] Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. PDF

[48] Evaluating step-by-step reasoning traces: A survey PDF

DAG-MATH benchmark with 2,894 gold-standard problems

[29] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning PDF

[30] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics PDF

[31] Benchmarking LLMs on Advanced Mathematical Reasoning PDF

[32] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning PDF

[33] Peano: learning formal mathematical reasoning PDF

[34] UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression PDF

[35] Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning PDF

[36] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF

[37] Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition PDF

[38] LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts PDF

Table of Contents