PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Physics ReasoningProcess-Level EvaluationSymbolic EquivalenceScientific Problem Solving

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively underexplored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PRISM-Physics, a process-level evaluation framework for physics reasoning that represents solutions as directed acyclic graphs of formulas with explicit causal dependencies. Within the taxonomy, it occupies the 'Process-Level Physics Reasoning Evaluation with Causal DAGs' leaf, which contains only two papers total including this work. This represents a relatively sparse research direction compared to neighboring areas like Bayesian student assessment (four papers) or causal discovery methods (three papers), suggesting the paper addresses an emerging rather than saturated problem space.

The taxonomy reveals three major branches using DAG structures: causal inference from data, general reasoning frameworks, and educational assessment. This work bridges the educational assessment branch with elements from reasoning frameworks, particularly in its emphasis on interpretable step-by-step evaluation rather than latent skill inference. Neighboring leaves include Bayesian networks for student assessment, which focus on probabilistic skill modeling from response patterns, and problem complexity quantification using directed networks. The scope notes clarify that this leaf specifically targets process-level physics evaluation with causal DAGs, distinguishing it from both traditional Bayesian assessment and general problem-solving frameworks.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The PRISM-Physics benchmark contribution examined ten candidates with zero refutable matches, as did the DAG-based scoring policy with optimality guarantees and the rule-based symbolic equivalence checker. This suggests that within the limited search scope, the specific combination of physics-focused process evaluation, theoretically grounded DAG scoring, and rule-based formula matching appears relatively unexplored. However, the small candidate pool means the analysis captures top semantic matches rather than exhaustive prior work coverage.

Based on the limited literature search of thirty semantically similar papers, the work appears to occupy a distinct position combining process-level physics evaluation with formal DAG representations and symbolic reasoning. The sparse taxonomy leaf and absence of refuting candidates suggest novelty, though the restricted search scope means potentially relevant work in adjacent areas like automated theorem proving or symbolic mathematics may not be fully represented in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: process-level evaluation of physics reasoning using directed acyclic graphs. The field structure suggested by the taxonomy reveals three main branches that leverage DAG representations in distinct ways. The first branch, Causal Inference and Discovery Using DAG Structures, encompasses methods that learn or exploit causal relationships from data, spanning applications from geotechnical systems (Geotechnical Causal Discovery[8]) to physical activity modeling (DAGs Physical Activity[4]) and experience sampling (Experience Sampling Causal[1]). The second branch, DAG-Based Reasoning and Problem-Solving Frameworks, focuses on computational frameworks that use DAGs to structure problem-solving processes, including hierarchical synthesis approaches (Hierarchical GFlowNet Synthesis[7]) and theoretical investigations of length generalization (Length Generalization Theory[6], Length Generalization Conditions[10]). The third branch, Educational Assessment and Cognitive Modeling Using DAGs, applies DAG-based representations to evaluate student understanding and cognitive processes, with foundational work in Bayesian networks for assessment (Bayesian Nets Assessment[13], Bayesian Cognitive Assessment[12]) and more recent online physics evaluation systems (Online Physics Assessment[11]). A particularly active line of work within educational assessment explores how DAG structures can capture the granular steps of reasoning rather than merely final outcomes. PRISM Physics[0] sits squarely in this process-level evaluation cluster, emphasizing causal DAGs to model intermediate reasoning stages in physics problem-solving. This contrasts with earlier Bayesian assessment frameworks (Bayesian Cognitive Assessment[12]) that primarily inferred latent skills from response patterns, and with online systems (Online Physics Assessment[11]) that may focus more on automated grading than on exposing the causal dependencies among reasoning steps. Nearby work on problem complexity networks (Problem Complexity Networks[3]) also examines structural representations of problem-solving, though with different emphases on complexity metrics. The main open question across these branches is how to balance the richness of process-level models with the practical demands of scalable, interpretable assessment in real educational settings.

Claimed Contributions

PRISM-Physics benchmark with DAG-structured solutions

10 retrieved papers

The authors construct a large-scale benchmark of competition-level physics problems where each solution is systematically converted into a directed acyclic graph (DAG) structure. This DAG representation explicitly encodes causal dependencies among formulas, enabling fine-grained and interpretable process-level evaluation beyond traditional final-answer scoring.

10 retrieved papers

DAG-based scoring policy with theoretical optimality guarantees

10 retrieved papers

The authors develop an Ancestor Closure Scoring Policy that propagates credit along causal chains in the solution DAG. They prove this policy is optimal under a justification system framework, showing it minimizes evaluation ambiguity and aligns naturally with the logical structure of physics derivations.

10 retrieved papers

Rule-based symbolic formula equivalence checker

10 retrieved papers

The authors introduce a two-stage algorithm for physics formula equivalence checking that handles constant substitutions, unit conversions, and equation equivalence through solution set comparison. This rule-based approach provides robust validation without depending on heuristic LLM-based judgments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Olae: A Bayesian Performance Assessment for Complex Problem Solving. PDF

Kurt VanLehn (2001)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PRISM-Physics benchmark with DAG-structured solutions

[29] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Cannot Refute

[30] OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems PDF

Cannot Refute

[31] Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai PDF

Cannot Refute

[32] PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions PDF

Cannot Refute

[33] Street: A multi-task structured reasoning and explanation benchmark PDF

Cannot Refute

[34] Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT PDF

Cannot Refute

[35] PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments PDF

Cannot Refute

[36] SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code PDF

Cannot Refute

[37] MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models PDF

Cannot Refute

[38] Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning PDF

Cannot Refute

Contribution

DAG-based scoring policy with theoretical optimality guarantees

[19] Evadrive: Evolutionary adversarial policy optimization for end-to-end autonomous driving PDF

Cannot Refute

[20] Causal discovery and counterfactual reasoning to optimize persuasive dialogue policies PDF

Cannot Refute

[21] Dice: Dynamic in-context example selection in llm agents via efficient knowledge transfer PDF

Cannot Refute

[22] Causal Reinforcement Learning for Knowledge Graph Reasoning PDF

Cannot Refute

[23] Functional Causal Bayesian Optimization PDF

Cannot Refute

[24] Contextual Multi-Armed Bandits for Causal Marketing PDF

Cannot Refute

[25] Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy PDF

Cannot Refute

[26] Envision: Benchmarking Unified Understanding&Generation for Causal World Process Insights PDF

Cannot Refute

[27] DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF PDF

Cannot Refute

[28] Learning to Optimize Feedback for One Million Students: Insights from Multi-Armed and Contextual Bandits in Large-Scale Online Tutoring PDF

Cannot Refute

Contribution

Rule-based symbolic formula equivalence checker

[39] Automated verification of weak equivalence within the SMODELS system PDF

Cannot Refute

[40] Autoformalize mathematical statements by symbolic equivalence and semantic consistency PDF

Cannot Refute

[41] Learning continuous semantic representations of symbolic expressions PDF

Cannot Refute

[42] EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations PDF

Cannot Refute

[43] From Bounded Checking to Verification of Equivalence via Symbolic Up-to Techniques PDF

Cannot Refute

[44] Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients PDF

Cannot Refute

[45] The metric is the message: benchmarking challenges for neural symbolic regression PDF

Cannot Refute

[46] Understanding equivalence of symbolic expressions in a spreadsheet-based environment PDF

Cannot Refute

[47] Automation of mathematics examinations PDF

Cannot Refute

[48] Complexity of mathematical expressions and its application in automatic answer checking PDF

Cannot Refute

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Olae: A Bayesian Performance Assessment for Complex Problem Solving. PDF

Contribution Analysis

PRISM-Physics benchmark with DAG-structured solutions

[29] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

[30] OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems PDF

[31] Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai PDF

[32] PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions PDF

[33] Street: A multi-task structured reasoning and explanation benchmark PDF

[34] Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT PDF

[35] PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments PDF

[36] SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code PDF

[37] MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models PDF

[38] Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning PDF

DAG-based scoring policy with theoretical optimality guarantees

[19] Evadrive: Evolutionary adversarial policy optimization for end-to-end autonomous driving PDF

[20] Causal discovery and counterfactual reasoning to optimize persuasive dialogue policies PDF

[21] Dice: Dynamic in-context example selection in llm agents via efficient knowledge transfer PDF

[22] Causal Reinforcement Learning for Knowledge Graph Reasoning PDF

[23] Functional Causal Bayesian Optimization PDF

[24] Contextual Multi-Armed Bandits for Causal Marketing PDF

[25] Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy PDF

[26] Envision: Benchmarking Unified Understanding&Generation for Causal World Process Insights PDF

[27] DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF PDF

[28] Learning to Optimize Feedback for One Million Students: Insights from Multi-Armed and Contextual Bandits in Large-Scale Online Tutoring PDF

Rule-based symbolic formula equivalence checker

[39] Automated verification of weak equivalence within the SMODELS system PDF

[40] Autoformalize mathematical statements by symbolic equivalence and semantic consistency PDF

[41] Learning continuous semantic representations of symbolic expressions PDF

[42] EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations PDF

[43] From Bounded Checking to Verification of Equivalence via Symbolic Up-to Techniques PDF

[44] Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients PDF

[45] The metric is the message: benchmarking challenges for neural symbolic regression PDF

[46] Understanding equivalence of symbolic expressions in a spreadsheet-based environment PDF

[47] Automation of mathematics examinations PDF

[48] Complexity of mathematical expressions and its application in automatic answer checking PDF

Table of Contents