CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
Overview
Overall Novelty Assessment
The paper introduces CMT-Benchmark, a dataset of 50 expert-curated problems in condensed matter theory targeting research-level competence. It resides in the 'Condensed Matter Physics Benchmarks' leaf alongside three sibling papers (CMPhysBench, QMBench, and one other work). This leaf is part of a moderately populated branch on domain-specific physics and materials science benchmarks, indicating a growing but not yet saturated research direction focused on evaluating LLMs in specialized physics subfields.
The taxonomy reveals neighboring leaves addressing materials science knowledge benchmarks, quantum mechanics evaluations, and frontier physics research tasks. CMT-Benchmark's focus on condensed matter theory—covering Hartree-Fock, exact diagonalization, quantum Monte Carlo, and DMRG—positions it at the intersection of quantum many-body physics and computational methods. This distinguishes it from broader materials informatics benchmarks and general quantum mechanics evaluations, carving out a niche for deep theoretical physics assessment rather than property prediction or multi-disciplinary breadth.
Among 30 candidates examined, none clearly refute the three core contributions: the expert-curated dataset (10 candidates, 0 refutable), the machine-grading framework for non-commuting operators (10 candidates, 0 refutable), and the evaluation revealing LLM reasoning gaps (10 candidates, 0 refutable). The limited search scope suggests that within the top semantic matches, no prior work directly overlaps with the combination of research-level condensed matter problems, collaborative expert curation, and automated grading for advanced quantum operators. Each contribution appears relatively novel given the examined literature.
Based on the top-30 semantic matches and taxonomy structure, the work occupies a sparsely populated niche within domain-specific physics benchmarks. The absence of refutable candidates across all contributions, combined with the small sibling set in the taxonomy leaf, suggests the paper addresses a gap in research-level condensed matter evaluation. However, the limited search scope means broader or less semantically similar prior work may exist outside the examined set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce CMT-Benchmark, a dataset of 50 original problems in condensed matter theory designed and verified by an international panel of expert researchers. The problems cover analytical and computational methods at research level, including Hartree-Fock theory, exact diagonalization, quantum Monte Carlo, DMRG, and statistical mechanics.
The authors developed automated evaluation mechanisms capable of grading advanced physics problems, including a novel parser that handles non-commutative operator algebra through symbolic manipulation and normal ordering. This enables deterministic, objective grading of quantum many-body physics problems.
The authors conducted rigorous evaluations of 17 frontier LLMs, revealing that even the best model (GPT5) achieves only 30% accuracy, with 18 problems unsolved by any model. The evaluation exposes fundamental gaps in LLM reasoning, including violations of physical symmetries and unphysical scaling dimensions.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics PDF
[25] QMBench: A Research Level Benchmark for Quantum Materials Research PDF
[27] Research on Condensed Matter Property Prediction Based on Large Language Model PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
CMT-Benchmark: Expert-Curated Research-Level Dataset for Condensed Matter Theory
The authors introduce CMT-Benchmark, a dataset of 50 original problems in condensed matter theory designed and verified by an international panel of expert researchers. The problems cover analytical and computational methods at research level, including Hartree-Fock theory, exact diagonalization, quantum Monte Carlo, DMRG, and statistical mechanics.
[3] CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning PDF
[9] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics PDF
[16] Topomas: Large language model driven topological materials multiagent system PDF
[44] Materials expert-artificial intelligence for materials discovery PDF
[45] A database of experimentally measured lithium solid electrolyte conductivities evaluated with machine learning PDF
[46] Literature classification and its applications in condensed matter physics and materials science by natural language processing PDF
[47] Deep learning for symmetry classification using sparse 3D electron density data for inorganic compounds PDF
[48] Towards accurate prediction of configurational disorder properties in materials using graph neural networks PDF
[49] How to verify the precision of density-functional-theory implementations via reproducible and universal workflows PDF
[50] Quantum Algorithm Software for Condensed Matter Physics PDF
Machine-Grading Framework for Advanced Physics Problems with Non-Commuting Operators
The authors developed automated evaluation mechanisms capable of grading advanced physics problems, including a novel parser that handles non-commutative operator algebra through symbolic manipulation and normal ordering. This enables deterministic, objective grading of quantum many-body physics problems.
[1] Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks PDF
[35] Symbolic machine learning for high energy physics calculations PDF
[36] Explorations in Computational Physics PDF
[37] Physics-informed neural networks for PDE problems: A comprehensive review PDF
[38] Large physics models: towards a collaborative approach with large language models and foundation models PDF
[39] Reinforcement Learning with Physics-Informed Symbolic Program Priors for Zero-Shot Wireless Indoor Navigation PDF
[40] Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws PDF
[41] PISR: Physics-Informed Symbolic Regression for Predicting Power System Voltage PDF
[42] DeepONet as a Multi-Operator Extrapolation Model: Distributed Pretraining with Physics-Informed Fine-Tuning PDF
[43] A Mathematical Model for Representing the Related Operator Professional Activities and Their Diagnostic Assessments Based on the Quantum Representations PDF
Rigorous Evaluation Revealing Fundamental Gaps in LLM Scientific Reasoning
The authors conducted rigorous evaluations of 17 frontier LLMs, revealing that even the best model (GPT5) achieves only 30% accuracy, with 18 problems unsolved by any model. The evaluation exposes fundamental gaps in LLM reasoning, including violations of physical symmetries and unphysical scaling dimensions.