CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelstatistical mechanicsbenchmarkevaluationnumerical methodsscientific problem solvingcondensed matter physicsquantum physics
Abstract:

Large language models (LLMs) have demonstrated remarkable progress in coding and mathematical problem-solving; however, evaluation on advanced research-level problems in the hard sciences remains scarce. To fill this gap, we present \cmt, a dataset of 50 original problems covering condensed matter theory (CMT) at the level of an expert researcher. The topics cover analytical and computational approaches commonly used in quantum many-body physics as well as classical statistical mechanics. This dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine difficult problems that the panel would like their research assistants to be able to solve, with topics including Hartree-Fock mean-field theory, exact diagonalization, quantum Monte Carlo, density matrix renormalization group, quantum statistical mechanics, classical statistical mechanics, and model building. We evaluate different LLMs by programmatically checking LLM-generated solutions against expert-supplied ground truth. For this, we developed machine-grading mechanisms that are suitable for advanced physics research problems. For example, we handle non-commuting operators that are essential for quantum many-body problems by symbolic manipulation and normal ordering. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. While the highest-performing model, GPT5, correctly solves 30% of the problems, average performance across 17 models (GPT, Gemini, Claude, DeepSeek, and Llama classes) is only 11.4±\pm2.1%. Moreover, our benchmark contains 18 problems that {\it not a single one} of the 17 models can correctly solve, and 26 problems that are solved by {\it at most} one model. These currently unsolvable problems span the fields of Quantum Monte Carlo, Variational Monte Carlo, and Density Matrix Renormalization Group. The answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe that this benchmark set provides valuable guidance for the future development of language models, aiming to achieve the goal of AI research assistants and tutors.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CMT-Benchmark, a dataset of 50 expert-curated problems in condensed matter theory targeting research-level competence. It resides in the 'Condensed Matter Physics Benchmarks' leaf alongside three sibling papers (CMPhysBench, QMBench, and one other work). This leaf is part of a moderately populated branch on domain-specific physics and materials science benchmarks, indicating a growing but not yet saturated research direction focused on evaluating LLMs in specialized physics subfields.

The taxonomy reveals neighboring leaves addressing materials science knowledge benchmarks, quantum mechanics evaluations, and frontier physics research tasks. CMT-Benchmark's focus on condensed matter theory—covering Hartree-Fock, exact diagonalization, quantum Monte Carlo, and DMRG—positions it at the intersection of quantum many-body physics and computational methods. This distinguishes it from broader materials informatics benchmarks and general quantum mechanics evaluations, carving out a niche for deep theoretical physics assessment rather than property prediction or multi-disciplinary breadth.

Among 30 candidates examined, none clearly refute the three core contributions: the expert-curated dataset (10 candidates, 0 refutable), the machine-grading framework for non-commuting operators (10 candidates, 0 refutable), and the evaluation revealing LLM reasoning gaps (10 candidates, 0 refutable). The limited search scope suggests that within the top semantic matches, no prior work directly overlaps with the combination of research-level condensed matter problems, collaborative expert curation, and automated grading for advanced quantum operators. Each contribution appears relatively novel given the examined literature.

Based on the top-30 semantic matches and taxonomy structure, the work occupies a sparsely populated niche within domain-specific physics benchmarks. The absence of refutable candidates across all contributions, combined with the small sibling set in the taxonomy leaf, suggests the paper addresses a gap in research-level condensed matter evaluation. However, the limited search scope means broader or less semantically similar prior work may exist outside the examined set.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models on research-level condensed matter theory problems. The field has organized itself around several complementary directions. Domain-specific benchmarks for physics and materials science form a dense branch, with works like CMT Benchmark[0], CMPhysBench[9], and QMBench[25] targeting condensed matter phenomena, quantum mechanics, and materials properties through curated problem sets. Multi-disciplinary and broad-scope evaluation benchmarks such as Supergpqa[2] and R-bench[10] assess reasoning across wider scientific domains, while domain-specific LLM enhancement and application systems—including LLaMP[6], Matterchat[7], and Agentic Physics Exploration[8]—build specialized tools and agents for materials discovery and physics exploration. A smaller cluster focuses on LLM-assisted scientific data integration and retrieval, leveraging knowledge graphs and literature mining to support discovery workflows. Theoretical frameworks for LLM structure and reasoning, alongside branches on condensed matter physics theory itself and biographical accounts, round out the taxonomy by addressing foundational questions and historical context. Particularly active lines of work contrast narrow, expert-level evaluation with broader scientific reasoning. Benchmarks like CMT Benchmark[0] and QMBench[25] emphasize deep domain expertise in condensed matter and quantum many-body physics, probing whether models can handle graduate-level derivations and conceptual subtleties. In contrast, works such as Supergpqa[2] and CURIE[3] explore multi-domain scientific problem-solving, trading depth for breadth. CMT Benchmark[0] sits squarely within the condensed matter physics benchmarks cluster, sharing its focus on research-level theory with CMPhysBench[9] and QMBench[25], yet it distinguishes itself by targeting the specific challenges of condensed matter theory rather than broader quantum mechanics or materials informatics. This positioning highlights an open question: whether specialized benchmarks better reveal model limitations in expert domains than do general-purpose scientific evaluations.

Claimed Contributions

CMT-Benchmark: Expert-Curated Research-Level Dataset for Condensed Matter Theory

The authors introduce CMT-Benchmark, a dataset of 50 original problems in condensed matter theory designed and verified by an international panel of expert researchers. The problems cover analytical and computational methods at research level, including Hartree-Fock theory, exact diagonalization, quantum Monte Carlo, DMRG, and statistical mechanics.

10 retrieved papers
Machine-Grading Framework for Advanced Physics Problems with Non-Commuting Operators

The authors developed automated evaluation mechanisms capable of grading advanced physics problems, including a novel parser that handles non-commutative operator algebra through symbolic manipulation and normal ordering. This enables deterministic, objective grading of quantum many-body physics problems.

10 retrieved papers
Rigorous Evaluation Revealing Fundamental Gaps in LLM Scientific Reasoning

The authors conducted rigorous evaluations of 17 frontier LLMs, revealing that even the best model (GPT5) achieves only 30% accuracy, with 18 problems unsolved by any model. The evaluation exposes fundamental gaps in LLM reasoning, including violations of physical symmetries and unphysical scaling dimensions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CMT-Benchmark: Expert-Curated Research-Level Dataset for Condensed Matter Theory

The authors introduce CMT-Benchmark, a dataset of 50 original problems in condensed matter theory designed and verified by an international panel of expert researchers. The problems cover analytical and computational methods at research level, including Hartree-Fock theory, exact diagonalization, quantum Monte Carlo, DMRG, and statistical mechanics.

Contribution

Machine-Grading Framework for Advanced Physics Problems with Non-Commuting Operators

The authors developed automated evaluation mechanisms capable of grading advanced physics problems, including a novel parser that handles non-commutative operator algebra through symbolic manipulation and normal ordering. This enables deterministic, objective grading of quantum many-body physics problems.

Contribution

Rigorous Evaluation Revealing Fundamental Gaps in LLM Scientific Reasoning

The authors conducted rigorous evaluations of 17 frontier LLMs, revealing that even the best model (GPT5) achieves only 30% accuracy, with 18 problems unsolved by any model. The evaluation exposes fundamental gaps in LLM reasoning, including violations of physical symmetries and unphysical scaling dimensions.