CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Overview
Overall Novelty Assessment
The paper introduces CMPhysBench, a graduate-level benchmark for condensed matter physics comprising over 520 calculation problems, alongside the SEED metric for fine-grained evaluation. This work resides in the 'Graduate-Level Problem-Solving Benchmarks' leaf, which contains only two papers including this one. The sibling paper (CMT Benchmark) similarly targets condensed matter theory evaluation. This represents a relatively sparse research direction within the broader taxonomy, suggesting the specific focus on graduate-level condensed matter calculation problems addresses an underexplored niche.
The taxonomy reveals neighboring work in 'Multidisciplinary Scientific Benchmarks' (four papers) that evaluate LLMs across multiple scientific domains including physics. The parent branch 'Benchmark Development and Evaluation Frameworks' distinguishes domain-specific condensed matter benchmarks from broader scientific assessments. Related directions include 'Quantum Experiment and Simulation Design' and 'Predictive Modeling' under LLM Applications, which focus on applying models rather than systematically evaluating their capabilities. The scope notes clarify that this work's emphasis on structured evaluation separates it from application-driven systems.
Among 26 candidates examined across three contributions, no clearly refuting prior work was identified. The CMPhysBench benchmark contribution examined 10 candidates with zero refutations, the SEED metric examined 6 candidates with zero refutations, and the empirical evaluation examined 10 candidates with zero refutations. This limited search scope suggests the specific combination of graduate-level condensed matter problems, calculation-focused tasks, and tree-based expression evaluation appears relatively novel within the examined literature, though the analysis does not claim exhaustive coverage of all potentially relevant prior work.
Based on the top-26 semantic matches examined, the work appears to occupy a distinct position combining domain specialization (condensed matter physics), task design (open-ended calculations), and evaluation methodology (SEED metric). The sparse population of the taxonomy leaf and absence of refuting candidates within the search scope suggest meaningful differentiation from existing benchmarks, though the limited search scale means potentially relevant work outside these candidates remains unexamined.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce CMPhysBench, a novel benchmark containing 520 graduate-level questions in Condensed Matter Physics. Unlike existing multiple-choice benchmarks, it focuses exclusively on calculation problems requiring comprehensive solutions, covering six representative topics including magnetism, superconductivity, and strongly correlated systems with five distinct answer types.
The authors develop SEED, a novel evaluation metric that extends Expression Edit Distance by converting diverse answer types (expressions, equations, tuples, intervals, numeric) to abstract syntax trees and computing tree-edit distance. This provides fine-grained partial credit scoring rather than binary accuracy, with physics-aware normalization for robust evaluation of mathematical expressions.
The authors conduct a systematic evaluation of 18 large language models on CMPhysBench, demonstrating that even top-performing models achieve only 36 SEED score and 29% accuracy. Their analysis reveals significant capability gaps in condensed matter physics compared to general mathematical reasoning, with detailed error categorization identifying concept misuse and logical errors as primary failure modes.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
CMPhysBench: Graduate-level Condensed Matter Physics benchmark with open-ended calculation problems
The authors introduce CMPhysBench, a novel benchmark containing 520 graduate-level questions in Condensed Matter Physics. Unlike existing multiple-choice benchmarks, it focuses exclusively on calculation problems requiring comprehensive solutions, covering six representative topics including magnetism, superconductivity, and strongly correlated systems with five distinct answer types.
[47] Step by Step Calculation of the Penman-Monteith Evapotranspiration (FAO-56 Method) PDF
[48] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions PDF
[49] Physreason: A comprehensive benchmark towards physics-based reasoning PDF
[50] Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark PDF
[51] Scaling physical reasoning with the physics dataset PDF
[52] Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math PDF
[53] C2STEM: A system for synergistic learning of physics and computational thinking PDF
[54] PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving PDF
[55] Arb: Advanced reasoning benchmark for large language models PDF
[56] Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics PDF
Scalable Expression Edit Distance (SEED) metric for fine-grained evaluation
The authors develop SEED, a novel evaluation metric that extends Expression Edit Distance by converting diverse answer types (expressions, equations, tuples, intervals, numeric) to abstract syntax trees and computing tree-edit distance. This provides fine-grained partial credit scoring rather than binary accuracy, with physics-aware normalization for robust evaluation of mathematical expressions.
[41] Efficient Feedback and Partial Credit Grading for Proof Blocks Problems PDF
[42] The effects of students' discussion in mathematical modelling PDF
[43] Visualization of Solution Processes to Reach the Correct Answer in Online Math Tests PDF
[44] ⦠selection methods in multidimensional computerized adaptive testing adopting polytomously-scored items under multidimensional generalized partial credit ⦠PDF
[45] An investigation of stratification exposure control procedures in CATs using the generalized partial credit model PDF
[46] THE NATURE OF MATHEMATICAL MODELLING PDF
Comprehensive empirical evaluation revealing performance gaps in domain-specific physics reasoning
The authors conduct a systematic evaluation of 18 large language models on CMPhysBench, demonstrating that even top-performing models achieve only 36 SEED score and 29% accuracy. Their analysis reveals significant capability gaps in condensed matter physics compared to general mathematical reasoning, with detailed error categorization identifying concept misuse and logical errors as primary failure modes.