CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

ICLR 2026 Conference SubmissionAnonymous Authors
LLM BenchmarkCondensed Matter PhysicsLLM EvaluationAI for Physics
Abstract:

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 29% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CMPhysBench, a graduate-level benchmark for condensed matter physics comprising over 520 calculation problems, alongside the SEED metric for fine-grained evaluation. This work resides in the 'Graduate-Level Problem-Solving Benchmarks' leaf, which contains only two papers including this one. The sibling paper (CMT Benchmark) similarly targets condensed matter theory evaluation. This represents a relatively sparse research direction within the broader taxonomy, suggesting the specific focus on graduate-level condensed matter calculation problems addresses an underexplored niche.

The taxonomy reveals neighboring work in 'Multidisciplinary Scientific Benchmarks' (four papers) that evaluate LLMs across multiple scientific domains including physics. The parent branch 'Benchmark Development and Evaluation Frameworks' distinguishes domain-specific condensed matter benchmarks from broader scientific assessments. Related directions include 'Quantum Experiment and Simulation Design' and 'Predictive Modeling' under LLM Applications, which focus on applying models rather than systematically evaluating their capabilities. The scope notes clarify that this work's emphasis on structured evaluation separates it from application-driven systems.

Among 26 candidates examined across three contributions, no clearly refuting prior work was identified. The CMPhysBench benchmark contribution examined 10 candidates with zero refutations, the SEED metric examined 6 candidates with zero refutations, and the empirical evaluation examined 10 candidates with zero refutations. This limited search scope suggests the specific combination of graduate-level condensed matter problems, calculation-focused tasks, and tree-based expression evaluation appears relatively novel within the examined literature, though the analysis does not claim exhaustive coverage of all potentially relevant prior work.

Based on the top-26 semantic matches examined, the work appears to occupy a distinct position combining domain specialization (condensed matter physics), task design (open-ended calculations), and evaluation methodology (SEED metric). The sparse population of the taxonomy leaf and absence of refuting candidates within the search scope suggest meaningful differentiation from existing benchmarks, though the limited search scale means potentially relevant work outside these candidates remains unexamined.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models in condensed matter physics. The field has organized itself around several complementary directions. Benchmark Development and Evaluation Frameworks focuses on creating rigorous test sets and metrics to assess LLM capabilities on physics problems, ranging from graduate-level problem-solving to specialized domain knowledge. LLM Applications in Materials Discovery and Design explores how models can accelerate the search for novel materials, predict properties, and guide experimental workflows. Domain Enhancement and Adaptation Strategies investigates methods to improve LLM performance through fine-tuning, retrieval augmentation, and physics-informed architectures. Empirical Analysis and Robustness Studies examines model reliability, failure modes, and generalization across different physics subdomains. Methodological Foundations and Cross-Domain Perspectives provides theoretical grounding and connections to broader AI research, while Specialized Physics Applications targets specific phenomena like superconductivity or topological materials. Universal Potentials and Simulation Acceleration develops neural network models for atomic-scale simulations. Within Benchmark Development, a particularly active line of work centers on graduate-level problem-solving benchmarks that test deep conceptual understanding rather than simple factual recall. CMPhysBench[0] contributes to this effort by providing a curated set of condensed matter physics problems designed to probe reasoning capabilities. This work sits alongside CMT Benchmark[5], which similarly targets condensed matter theory evaluation, and complements broader physics benchmarks like Quantum Mechanics LLMs[3] and QMBench[28] that assess quantum physics understanding. A key tension across these benchmarks involves balancing problem difficulty with diagnostic value: overly specialized questions may not reveal general reasoning patterns, while simpler tasks risk underestimating model limitations. The emphasis in CMPhysBench[0] on condensed matter specifically allows for deeper probing of domain expertise compared to more general physics assessments, though this specialization also raises questions about how performance translates across subfields.

Claimed Contributions

CMPhysBench: Graduate-level Condensed Matter Physics benchmark with open-ended calculation problems

The authors introduce CMPhysBench, a novel benchmark containing 520 graduate-level questions in Condensed Matter Physics. Unlike existing multiple-choice benchmarks, it focuses exclusively on calculation problems requiring comprehensive solutions, covering six representative topics including magnetism, superconductivity, and strongly correlated systems with five distinct answer types.

10 retrieved papers
Scalable Expression Edit Distance (SEED) metric for fine-grained evaluation

The authors develop SEED, a novel evaluation metric that extends Expression Edit Distance by converting diverse answer types (expressions, equations, tuples, intervals, numeric) to abstract syntax trees and computing tree-edit distance. This provides fine-grained partial credit scoring rather than binary accuracy, with physics-aware normalization for robust evaluation of mathematical expressions.

6 retrieved papers
Comprehensive empirical evaluation revealing performance gaps in domain-specific physics reasoning

The authors conduct a systematic evaluation of 18 large language models on CMPhysBench, demonstrating that even top-performing models achieve only 36 SEED score and 29% accuracy. Their analysis reveals significant capability gaps in condensed matter physics compared to general mathematical reasoning, with detailed error categorization identifying concept misuse and logical errors as primary failure modes.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CMPhysBench: Graduate-level Condensed Matter Physics benchmark with open-ended calculation problems

The authors introduce CMPhysBench, a novel benchmark containing 520 graduate-level questions in Condensed Matter Physics. Unlike existing multiple-choice benchmarks, it focuses exclusively on calculation problems requiring comprehensive solutions, covering six representative topics including magnetism, superconductivity, and strongly correlated systems with five distinct answer types.

Contribution

Scalable Expression Edit Distance (SEED) metric for fine-grained evaluation

The authors develop SEED, a novel evaluation metric that extends Expression Edit Distance by converting diverse answer types (expressions, equations, tuples, intervals, numeric) to abstract syntax trees and computing tree-edit distance. This provides fine-grained partial credit scoring rather than binary accuracy, with physics-aware normalization for robust evaluation of mathematical expressions.

Contribution

Comprehensive empirical evaluation revealing performance gaps in domain-specific physics reasoning

The authors conduct a systematic evaluation of 18 large language models on CMPhysBench, demonstrating that even top-performing models achieve only 36 SEED score and 29% accuracy. Their analysis reveals significant capability gaps in condensed matter physics compared to general mathematical reasoning, with detailed error categorization identifying concept misuse and logical errors as primary failure modes.