PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Physics ReasoningProcess-Level EvaluationSymbolic EquivalenceScientific Problem Solving
Abstract:

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively underexplored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PRISM-Physics, a process-level evaluation framework for physics reasoning that represents solutions as directed acyclic graphs of formulas with explicit causal dependencies. Within the taxonomy, it occupies the 'Process-Level Physics Reasoning Evaluation with Causal DAGs' leaf, which contains only two papers total including this work. This represents a relatively sparse research direction compared to neighboring areas like Bayesian student assessment (four papers) or causal discovery methods (three papers), suggesting the paper addresses an emerging rather than saturated problem space.

The taxonomy reveals three major branches using DAG structures: causal inference from data, general reasoning frameworks, and educational assessment. This work bridges the educational assessment branch with elements from reasoning frameworks, particularly in its emphasis on interpretable step-by-step evaluation rather than latent skill inference. Neighboring leaves include Bayesian networks for student assessment, which focus on probabilistic skill modeling from response patterns, and problem complexity quantification using directed networks. The scope notes clarify that this leaf specifically targets process-level physics evaluation with causal DAGs, distinguishing it from both traditional Bayesian assessment and general problem-solving frameworks.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The PRISM-Physics benchmark contribution examined ten candidates with zero refutable matches, as did the DAG-based scoring policy with optimality guarantees and the rule-based symbolic equivalence checker. This suggests that within the limited search scope, the specific combination of physics-focused process evaluation, theoretically grounded DAG scoring, and rule-based formula matching appears relatively unexplored. However, the small candidate pool means the analysis captures top semantic matches rather than exhaustive prior work coverage.

Based on the limited literature search of thirty semantically similar papers, the work appears to occupy a distinct position combining process-level physics evaluation with formal DAG representations and symbolic reasoning. The sparse taxonomy leaf and absence of refuting candidates suggest novelty, though the restricted search scope means potentially relevant work in adjacent areas like automated theorem proving or symbolic mathematics may not be fully represented in this analysis.

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: process-level evaluation of physics reasoning using directed acyclic graphs. The field structure suggested by the taxonomy reveals three main branches that leverage DAG representations in distinct ways. The first branch, Causal Inference and Discovery Using DAG Structures, encompasses methods that learn or exploit causal relationships from data, spanning applications from geotechnical systems (Geotechnical Causal Discovery[8]) to physical activity modeling (DAGs Physical Activity[4]) and experience sampling (Experience Sampling Causal[1]). The second branch, DAG-Based Reasoning and Problem-Solving Frameworks, focuses on computational frameworks that use DAGs to structure problem-solving processes, including hierarchical synthesis approaches (Hierarchical GFlowNet Synthesis[7]) and theoretical investigations of length generalization (Length Generalization Theory[6], Length Generalization Conditions[10]). The third branch, Educational Assessment and Cognitive Modeling Using DAGs, applies DAG-based representations to evaluate student understanding and cognitive processes, with foundational work in Bayesian networks for assessment (Bayesian Nets Assessment[13], Bayesian Cognitive Assessment[12]) and more recent online physics evaluation systems (Online Physics Assessment[11]). A particularly active line of work within educational assessment explores how DAG structures can capture the granular steps of reasoning rather than merely final outcomes. PRISM Physics[0] sits squarely in this process-level evaluation cluster, emphasizing causal DAGs to model intermediate reasoning stages in physics problem-solving. This contrasts with earlier Bayesian assessment frameworks (Bayesian Cognitive Assessment[12]) that primarily inferred latent skills from response patterns, and with online systems (Online Physics Assessment[11]) that may focus more on automated grading than on exposing the causal dependencies among reasoning steps. Nearby work on problem complexity networks (Problem Complexity Networks[3]) also examines structural representations of problem-solving, though with different emphases on complexity metrics. The main open question across these branches is how to balance the richness of process-level models with the practical demands of scalable, interpretable assessment in real educational settings.

Claimed Contributions

PRISM-Physics benchmark with DAG-structured solutions

The authors construct a large-scale benchmark of competition-level physics problems where each solution is systematically converted into a directed acyclic graph (DAG) structure. This DAG representation explicitly encodes causal dependencies among formulas, enabling fine-grained and interpretable process-level evaluation beyond traditional final-answer scoring.

10 retrieved papers
DAG-based scoring policy with theoretical optimality guarantees

The authors develop an Ancestor Closure Scoring Policy that propagates credit along causal chains in the solution DAG. They prove this policy is optimal under a justification system framework, showing it minimizes evaluation ambiguity and aligns naturally with the logical structure of physics derivations.

10 retrieved papers
Rule-based symbolic formula equivalence checker

The authors introduce a two-stage algorithm for physics formula equivalence checking that handles constant substitutions, unit conversions, and equation equivalence through solution set comparison. This rule-based approach provides robust validation without depending on heuristic LLM-based judgments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PRISM-Physics benchmark with DAG-structured solutions

The authors construct a large-scale benchmark of competition-level physics problems where each solution is systematically converted into a directed acyclic graph (DAG) structure. This DAG representation explicitly encodes causal dependencies among formulas, enabling fine-grained and interpretable process-level evaluation beyond traditional final-answer scoring.

Contribution

DAG-based scoring policy with theoretical optimality guarantees

The authors develop an Ancestor Closure Scoring Policy that propagates credit along causal chains in the solution DAG. They prove this policy is optimal under a justification system framework, showing it minimizes evaluation ambiguity and aligns naturally with the logical structure of physics derivations.

Contribution

Rule-based symbolic formula equivalence checker

The authors introduce a two-stage algorithm for physics formula equivalence checking that handles constant substitutions, unit conversions, and equation equivalence through solution set comparison. This rule-based approach provides robust validation without depending on heuristic LLM-based judgments.