CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

LLM BenchmarkCondensed Matter PhysicsLLM EvaluationAI for Physics

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 29% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CMPhysBench, a graduate-level benchmark for condensed matter physics comprising over 520 calculation problems, alongside the SEED metric for fine-grained evaluation. This work resides in the 'Graduate-Level Problem-Solving Benchmarks' leaf, which contains only two papers including this one. The sibling paper (CMT Benchmark) similarly targets condensed matter theory evaluation. This represents a relatively sparse research direction within the broader taxonomy, suggesting the specific focus on graduate-level condensed matter calculation problems addresses an underexplored niche.

The taxonomy reveals neighboring work in 'Multidisciplinary Scientific Benchmarks' (four papers) that evaluate LLMs across multiple scientific domains including physics. The parent branch 'Benchmark Development and Evaluation Frameworks' distinguishes domain-specific condensed matter benchmarks from broader scientific assessments. Related directions include 'Quantum Experiment and Simulation Design' and 'Predictive Modeling' under LLM Applications, which focus on applying models rather than systematically evaluating their capabilities. The scope notes clarify that this work's emphasis on structured evaluation separates it from application-driven systems.

Among 26 candidates examined across three contributions, no clearly refuting prior work was identified. The CMPhysBench benchmark contribution examined 10 candidates with zero refutations, the SEED metric examined 6 candidates with zero refutations, and the empirical evaluation examined 10 candidates with zero refutations. This limited search scope suggests the specific combination of graduate-level condensed matter problems, calculation-focused tasks, and tree-based expression evaluation appears relatively novel within the examined literature, though the analysis does not claim exhaustive coverage of all potentially relevant prior work.

Based on the top-26 semantic matches examined, the work appears to occupy a distinct position combining domain specialization (condensed matter physics), task design (open-ended calculations), and evaluation methodology (SEED metric). The sparse population of the taxonomy leaf and absence of refuting candidates within the search scope suggest meaningful differentiation from existing benchmarks, though the limited search scale means potentially relevant work outside these candidates remains unexamined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models in condensed matter physics. The field has organized itself around several complementary directions. Benchmark Development and Evaluation Frameworks focuses on creating rigorous test sets and metrics to assess LLM capabilities on physics problems, ranging from graduate-level problem-solving to specialized domain knowledge. LLM Applications in Materials Discovery and Design explores how models can accelerate the search for novel materials, predict properties, and guide experimental workflows. Domain Enhancement and Adaptation Strategies investigates methods to improve LLM performance through fine-tuning, retrieval augmentation, and physics-informed architectures. Empirical Analysis and Robustness Studies examines model reliability, failure modes, and generalization across different physics subdomains. Methodological Foundations and Cross-Domain Perspectives provides theoretical grounding and connections to broader AI research, while Specialized Physics Applications targets specific phenomena like superconductivity or topological materials. Universal Potentials and Simulation Acceleration develops neural network models for atomic-scale simulations. Within Benchmark Development, a particularly active line of work centers on graduate-level problem-solving benchmarks that test deep conceptual understanding rather than simple factual recall. CMPhysBench[0] contributes to this effort by providing a curated set of condensed matter physics problems designed to probe reasoning capabilities. This work sits alongside CMT Benchmark[5], which similarly targets condensed matter theory evaluation, and complements broader physics benchmarks like Quantum Mechanics LLMs[3] and QMBench[28] that assess quantum physics understanding. A key tension across these benchmarks involves balancing problem difficulty with diagnostic value: overly specialized questions may not reveal general reasoning patterns, while simpler tasks risk underestimating model limitations. The emphasis in CMPhysBench[0] on condensed matter specifically allows for deeper probing of domain expertise compared to more general physics assessments, though this specialization also raises questions about how performance translates across subfields.

Claimed Contributions

CMPhysBench: Graduate-level Condensed Matter Physics benchmark with open-ended calculation problems

10 retrieved papers

The authors introduce CMPhysBench, a novel benchmark containing 520 graduate-level questions in Condensed Matter Physics. Unlike existing multiple-choice benchmarks, it focuses exclusively on calculation problems requiring comprehensive solutions, covering six representative topics including magnetism, superconductivity, and strongly correlated systems with five distinct answer types.

10 retrieved papers

Scalable Expression Edit Distance (SEED) metric for fine-grained evaluation

6 retrieved papers

The authors develop SEED, a novel evaluation metric that extends Expression Edit Distance by converting diverse answer types (expressions, equations, tuples, intervals, numeric) to abstract syntax trees and computing tree-edit distance. This provides fine-grained partial credit scoring rather than binary accuracy, with physics-aware normalization for robust evaluation of mathematical expressions.

6 retrieved papers

Comprehensive empirical evaluation revealing performance gaps in domain-specific physics reasoning

10 retrieved papers

The authors conduct a systematic evaluation of 18 large language models on CMPhysBench, demonstrating that even top-performing models achieve only 36 SEED score and 29% accuracy. Their analysis reveals significant capability gaps in condensed matter physics compared to general mathematical reasoning, with detailed error categorization identifying concept misuse and logical errors as primary failure modes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers PDF

Pan, Haining, Roggeveen, James V., Haining Pan, Berg, Erez, J. V. Roggeveen, Carrasquilla, Juan, Erez Berg, Chowdhury, Debanjan, Juan Carrasquilla, Ganguli, Surya, Debanjan Chowdhury, Ghimenti, Federico, Surya Ganguli, Hasik, Juraj, Federico Ghimenti, Hunt, Henry, J. Hasik, Jiang-Hong Chen, Henry Hunt, Kamb, Mason, HongJiang Zhang, Kao, Ying-Jer, M. Kamb, Khatami, Ehsan, Ying-Jer Kao, Lawler, Michael J., Ehsan Khatami, Luo, Di, M. Lawler, Neupert, Titus, Di Luo, Qi XiaoLiang, T. Neupert, Brenner, Michael P., Xiaoliang Qi, Kim Eun-Ah, Michael P. Brenner, Eun-Ah Kim (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CMPhysBench: Graduate-level Condensed Matter Physics benchmark with open-ended calculation problems

[47] Step by Step Calculation of the Penman-Monteith Evapotranspiration (FAO-56 Method) PDF

Cannot Refute

[48] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions PDF

Cannot Refute

[49] Physreason: A comprehensive benchmark towards physics-based reasoning PDF

Cannot Refute

[50] Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark PDF

Cannot Refute

[51] Scaling physical reasoning with the physics dataset PDF

Cannot Refute

[52] Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math PDF

Cannot Refute

[53] C2STEM: A system for synergistic learning of physics and computational thinking PDF

Cannot Refute

[54] PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving PDF

Cannot Refute

[55] Arb: Advanced reasoning benchmark for large language models PDF

Cannot Refute

[56] Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics PDF

Cannot Refute

Contribution

Scalable Expression Edit Distance (SEED) metric for fine-grained evaluation

[41] Efficient Feedback and Partial Credit Grading for Proof Blocks Problems PDF

Cannot Refute

[42] The effects of students' discussion in mathematical modelling PDF

Cannot Refute

[43] Visualization of Solution Processes to Reach the Correct Answer in Online Math Tests PDF

Cannot Refute

[44] â¦ selection methods in multidimensional computerized adaptive testing adopting polytomously-scored items under multidimensional generalized partial credit â¦ PDF

Cannot Refute

[45] An investigation of stratification exposure control procedures in CATs using the generalized partial credit model PDF

Cannot Refute

[46] THE NATURE OF MATHEMATICAL MODELLING PDF

Cannot Refute

Contribution

Comprehensive empirical evaluation revealing performance gaps in domain-specific physics reasoning

[49] Physreason: A comprehensive benchmark towards physics-based reasoning PDF

Cannot Refute

[51] Scaling physical reasoning with the physics dataset PDF

Cannot Refute

[57] Feabench: Evaluating language models on real world physics reasoning ability PDF

Cannot Refute

[58] How understanding large language models can inform the use of ChatGPT in physics education PDF

Cannot Refute

[59] Scibench: Evaluating college-level scientific problem-solving abilities of large language models PDF

Cannot Refute

[60] Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational â¦ PDF

Cannot Refute

[61] A Statistical Physics of Language Model Reasoning PDF

Cannot Refute

[62] Advances in apparent conceptual physics reasoning in GPT-4 PDF

Cannot Refute

[63] â¦ GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment PDF

Cannot Refute

[64] PhyX: Does Your Model Have the" Wits" for Physical Reasoning? PDF

Cannot Refute

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers PDF

Contribution Analysis

CMPhysBench: Graduate-level Condensed Matter Physics benchmark with open-ended calculation problems

[47] Step by Step Calculation of the Penman-Monteith Evapotranspiration (FAO-56 Method) PDF

[48] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions PDF

[49] Physreason: A comprehensive benchmark towards physics-based reasoning PDF

[50] Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark PDF

[51] Scaling physical reasoning with the physics dataset PDF

[52] Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math PDF

[53] C2STEM: A system for synergistic learning of physics and computational thinking PDF

[54] PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving PDF

[55] Arb: Advanced reasoning benchmark for large language models PDF

[56] Theoretical Physics Benchmark (TPBench)--a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics PDF

Scalable Expression Edit Distance (SEED) metric for fine-grained evaluation

[41] Efficient Feedback and Partial Credit Grading for Proof Blocks Problems PDF

[42] The effects of students' discussion in mathematical modelling PDF

[43] Visualization of Solution Processes to Reach the Correct Answer in Online Math Tests PDF

[44] â¦ selection methods in multidimensional computerized adaptive testing adopting polytomously-scored items under multidimensional generalized partial credit â¦ PDF

[45] An investigation of stratification exposure control procedures in CATs using the generalized partial credit model PDF

[46] THE NATURE OF MATHEMATICAL MODELLING PDF

Comprehensive empirical evaluation revealing performance gaps in domain-specific physics reasoning

[49] Physreason: A comprehensive benchmark towards physics-based reasoning PDF

[51] Scaling physical reasoning with the physics dataset PDF

[57] Feabench: Evaluating language models on real world physics reasoning ability PDF

[58] How understanding large language models can inform the use of ChatGPT in physics education PDF

[59] Scibench: Evaluating college-level scientific problem-solving abilities of large language models PDF

[60] Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational â¦ PDF

[61] A Statistical Physics of Language Model Reasoning PDF

[62] Advances in apparent conceptual physics reasoning in GPT-4 PDF

[63] â¦ GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment PDF

[64] PhyX: Does Your Model Have the" Wits" for Physical Reasoning? PDF

Table of Contents

[44] â¦ selection methods in multidimensional computerized adaptive testing adopting polytomously-scored items under multidimensional generalized partial credit â¦ PDF

[60] Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational â¦ PDF

[63] â¦ GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment PDF