CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

large language modelstatistical mechanicsbenchmarkevaluationnumerical methodsscientific problem solvingcondensed matter physicsquantum physics

Large language models (LLMs) have demonstrated remarkable progress in coding and mathematical problem-solving; however, evaluation on advanced research-level problems in the hard sciences remains scarce. To fill this gap, we present \cmt, a dataset of 50 original problems covering condensed matter theory (CMT) at the level of an expert researcher. The topics cover analytical and computational approaches commonly used in quantum many-body physics as well as classical statistical mechanics. This dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine difficult problems that the panel would like their research assistants to be able to solve, with topics including Hartree-Fock mean-field theory, exact diagonalization, quantum Monte Carlo, density matrix renormalization group, quantum statistical mechanics, classical statistical mechanics, and model building. We evaluate different LLMs by programmatically checking LLM-generated solutions against expert-supplied ground truth. For this, we developed machine-grading mechanisms that are suitable for advanced physics research problems. For example, we handle non-commuting operators that are essential for quantum many-body problems by symbolic manipulation and normal ordering. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. While the highest-performing model, GPT5, correctly solves 30% of the problems, average performance across 17 models (GPT, Gemini, Claude, DeepSeek, and Llama classes) is only 11.4 $\pm$ 2.1%. Moreover, our benchmark contains 18 problems that {\it not a single one} of the 17 models can correctly solve, and 26 problems that are solved by {\it at most} one model. These currently unsolvable problems span the fields of Quantum Monte Carlo, Variational Monte Carlo, and Density Matrix Renormalization Group. The answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe that this benchmark set provides valuable guidance for the future development of language models, aiming to achieve the goal of AI research assistants and tutors.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CMT-Benchmark, a dataset of 50 expert-curated problems in condensed matter theory targeting research-level competence. It resides in the 'Condensed Matter Physics Benchmarks' leaf alongside three sibling papers (CMPhysBench, QMBench, and one other work). This leaf is part of a moderately populated branch on domain-specific physics and materials science benchmarks, indicating a growing but not yet saturated research direction focused on evaluating LLMs in specialized physics subfields.

The taxonomy reveals neighboring leaves addressing materials science knowledge benchmarks, quantum mechanics evaluations, and frontier physics research tasks. CMT-Benchmark's focus on condensed matter theory—covering Hartree-Fock, exact diagonalization, quantum Monte Carlo, and DMRG—positions it at the intersection of quantum many-body physics and computational methods. This distinguishes it from broader materials informatics benchmarks and general quantum mechanics evaluations, carving out a niche for deep theoretical physics assessment rather than property prediction or multi-disciplinary breadth.

Among 30 candidates examined, none clearly refute the three core contributions: the expert-curated dataset (10 candidates, 0 refutable), the machine-grading framework for non-commuting operators (10 candidates, 0 refutable), and the evaluation revealing LLM reasoning gaps (10 candidates, 0 refutable). The limited search scope suggests that within the top semantic matches, no prior work directly overlaps with the combination of research-level condensed matter problems, collaborative expert curation, and automated grading for advanced quantum operators. Each contribution appears relatively novel given the examined literature.

Based on the top-30 semantic matches and taxonomy structure, the work occupies a sparsely populated niche within domain-specific physics benchmarks. The absence of refutable candidates across all contributions, combined with the small sibling set in the taxonomy leaf, suggests the paper addresses a gap in research-level condensed matter evaluation. However, the limited search scope means broader or less semantically similar prior work may exist outside the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating large language models on research-level condensed matter theory problems. The field has organized itself around several complementary directions. Domain-specific benchmarks for physics and materials science form a dense branch, with works like CMT Benchmark[0], CMPhysBench[9], and QMBench[25] targeting condensed matter phenomena, quantum mechanics, and materials properties through curated problem sets. Multi-disciplinary and broad-scope evaluation benchmarks such as Supergpqa[2] and R-bench[10] assess reasoning across wider scientific domains, while domain-specific LLM enhancement and application systems—including LLaMP[6], Matterchat[7], and Agentic Physics Exploration[8]—build specialized tools and agents for materials discovery and physics exploration. A smaller cluster focuses on LLM-assisted scientific data integration and retrieval, leveraging knowledge graphs and literature mining to support discovery workflows. Theoretical frameworks for LLM structure and reasoning, alongside branches on condensed matter physics theory itself and biographical accounts, round out the taxonomy by addressing foundational questions and historical context. Particularly active lines of work contrast narrow, expert-level evaluation with broader scientific reasoning. Benchmarks like CMT Benchmark[0] and QMBench[25] emphasize deep domain expertise in condensed matter and quantum many-body physics, probing whether models can handle graduate-level derivations and conceptual subtleties. In contrast, works such as Supergpqa[2] and CURIE[3] explore multi-domain scientific problem-solving, trading depth for breadth. CMT Benchmark[0] sits squarely within the condensed matter physics benchmarks cluster, sharing its focus on research-level theory with CMPhysBench[9] and QMBench[25], yet it distinguishes itself by targeting the specific challenges of condensed matter theory rather than broader quantum mechanics or materials informatics. This positioning highlights an open question: whether specialized benchmarks better reveal model limitations in expert domains than do general-purpose scientific evaluations.

Claimed Contributions

CMT-Benchmark: Expert-Curated Research-Level Dataset for Condensed Matter Theory

10 retrieved papers

The authors introduce CMT-Benchmark, a dataset of 50 original problems in condensed matter theory designed and verified by an international panel of expert researchers. The problems cover analytical and computational methods at research level, including Hartree-Fock theory, exact diagonalization, quantum Monte Carlo, DMRG, and statistical mechanics.

10 retrieved papers

Machine-Grading Framework for Advanced Physics Problems with Non-Commuting Operators

10 retrieved papers

The authors developed automated evaluation mechanisms capable of grading advanced physics problems, including a novel parser that handles non-commutative operator algebra through symbolic manipulation and normal ordering. This enables deterministic, objective grading of quantum many-body physics problems.

10 retrieved papers

Rigorous Evaluation Revealing Fundamental Gaps in LLM Scientific Reasoning

10 retrieved papers

The authors conducted rigorous evaluations of 17 frontier LLMs, revealing that even the best model (GPT5) achieves only 30% accuracy, with 18 problems unsolved by any model. The evaluation exposes fundamental gaps in LLM reasoning, including violations of physical symmetries and unphysical scaling dimensions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics PDF

Wang, Weida, Huang, Dongchen, Weida Wang, Li Jiatong, Dongchen Huang, Jiatong Li, Zheng Zi-yang, Tengchao Yang, Zhang Di, Ziyang Zheng, Han Dong, Di Zhang, Dong Han, Benteng Chen, Liu Zhiyu, Binzhao Luo, Liu Kunling, Zhiyu Liu, Gao Zhi-yuan, Kunling Liu, Geng Shiqi, Zhiyuan Gao, Ma Wei, Shiqi Geng, Su JiaMing, Wei Ma, LI Xin, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Dou Zhihao, Qianjia Cheng, Cui, Dongfei, Zhihao Dou, Dongfei Cui, Zeng Jin, Changyong He, Xie, Zeke, Jin Zeng, Su Mao, Zeke Xie, Zhou, Dongzhan, Mao Su, Li Yuqiang, Dongzhan Zhou, Ouyang, Wanli, Yuqiang Li, Cai, Yunqi, Wanli Ouyang, Dai, Xi, Yunqi Cai, Zhang ShuFei, Xi Dai, Bai Lei, Shufei Zhang, Cheng, Jinguang, Lei Bai, Fang Zhong, Jinguang Cheng, Weng Hongming, Zhong Fang, Hongming Weng (2025)

[25] QMBench: A Research Level Benchmark for Quantum Materials Research PDF

Yanzhen Wang, Yiyang Jiang, Diana Golovanova, Kamal Das, Hyeonhu Bae, Yufei Zhao, Huu-Thong Le, Abhinava Chatterjee, Yunzhe Liu, Chao-Xing Liu, Felipe H. da Jornada, Binghai Yan, Xiao-Liang Qi (2025)

[27] Research on Condensed Matter Property Prediction Based on Large Language Model PDF

Peng Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CMT-Benchmark: Expert-Curated Research-Level Dataset for Condensed Matter Theory

[3] CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning PDF

Cannot Refute

[9] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics PDF

Cannot Refute

[16] Topomas: Large language model driven topological materials multiagent system PDF

Cannot Refute

[44] Materials expert-artificial intelligence for materials discovery PDF

Cannot Refute

[45] A database of experimentally measured lithium solid electrolyte conductivities evaluated with machine learning PDF

Cannot Refute

[46] Literature classification and its applications in condensed matter physics and materials science by natural language processing PDF

Cannot Refute

[47] Deep learning for symmetry classification using sparse 3D electron density data for inorganic compounds PDF

Cannot Refute

[48] Towards accurate prediction of configurational disorder properties in materials using graph neural networks PDF

Cannot Refute

[49] How to verify the precision of density-functional-theory implementations via reproducible and universal workflows PDF

Cannot Refute

[50] Quantum Algorithm Software for Condensed Matter Physics PDF

Cannot Refute

Contribution

Machine-Grading Framework for Advanced Physics Problems with Non-Commuting Operators

[1] Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks PDF

Cannot Refute

[35] Symbolic machine learning for high energy physics calculations PDF

Cannot Refute

[36] Explorations in Computational Physics PDF

Cannot Refute

[37] Physics-informed neural networks for PDE problems: A comprehensive review PDF

Cannot Refute

[38] Large physics models: towards a collaborative approach with large language models and foundation models PDF

Cannot Refute

[39] Reinforcement Learning with Physics-Informed Symbolic Program Priors for Zero-Shot Wireless Indoor Navigation PDF

Cannot Refute

[40] Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws PDF

Cannot Refute

[41] PISR: Physics-Informed Symbolic Regression for Predicting Power System Voltage PDF

Cannot Refute

[42] DeepONet as a Multi-Operator Extrapolation Model: Distributed Pretraining with Physics-Informed Fine-Tuning PDF

Cannot Refute

[43] A Mathematical Model for Representing the Related Operator Professional Activities and Their Diagnostic Assessments Based on the Quantum Representations PDF

Cannot Refute

Contribution

Rigorous Evaluation Revealing Fundamental Gaps in LLM Scientific Reasoning

[51] Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems PDF

Cannot Refute

[52] A survey on large language model reasoning failures PDF

Cannot Refute

[53] MacGyver: Are Large Language Models Creative Problem Solvers? PDF

Cannot Refute

[54] Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational â¦ PDF

Cannot Refute

[55] Examining the potential and pitfalls of ChatGPT in science and engineering problem-solving PDF

Cannot Refute

[56] General-reasoner: Advancing llm reasoning across all domains PDF

Cannot Refute

[57] Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment PDF

Cannot Refute

[58] Mephisto: Self-Improving Large Language Model-Based Agents for Automated Interpretation of Multi-band Galaxy Observations PDF

Cannot Refute

[59] Scaling physical reasoning with the physics dataset PDF

Cannot Refute

[60] Maps: Advancing multi-modal reasoning in expert-level physical science PDF

Cannot Refute

CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics PDF

[25] QMBench: A Research Level Benchmark for Quantum Materials Research PDF

[27] Research on Condensed Matter Property Prediction Based on Large Language Model PDF

Contribution Analysis

CMT-Benchmark: Expert-Curated Research-Level Dataset for Condensed Matter Theory

[3] CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning PDF

[9] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics PDF

[16] Topomas: Large language model driven topological materials multiagent system PDF

[44] Materials expert-artificial intelligence for materials discovery PDF

[45] A database of experimentally measured lithium solid electrolyte conductivities evaluated with machine learning PDF

[46] Literature classification and its applications in condensed matter physics and materials science by natural language processing PDF

[47] Deep learning for symmetry classification using sparse 3D electron density data for inorganic compounds PDF

[48] Towards accurate prediction of configurational disorder properties in materials using graph neural networks PDF

[49] How to verify the precision of density-functional-theory implementations via reproducible and universal workflows PDF

[50] Quantum Algorithm Software for Condensed Matter Physics PDF

Machine-Grading Framework for Advanced Physics Problems with Non-Commuting Operators

[1] Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks PDF

[35] Symbolic machine learning for high energy physics calculations PDF

[36] Explorations in Computational Physics PDF

[37] Physics-informed neural networks for PDE problems: A comprehensive review PDF

[38] Large physics models: towards a collaborative approach with large language models and foundation models PDF

[39] Reinforcement Learning with Physics-Informed Symbolic Program Priors for Zero-Shot Wireless Indoor Navigation PDF

[40] Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws PDF

[41] PISR: Physics-Informed Symbolic Regression for Predicting Power System Voltage PDF

[42] DeepONet as a Multi-Operator Extrapolation Model: Distributed Pretraining with Physics-Informed Fine-Tuning PDF

[43] A Mathematical Model for Representing the Related Operator Professional Activities and Their Diagnostic Assessments Based on the Quantum Representations PDF

Rigorous Evaluation Revealing Fundamental Gaps in LLM Scientific Reasoning

[51] Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems PDF

[52] A survey on large language model reasoning failures PDF

[53] MacGyver: Are Large Language Models Creative Problem Solvers? PDF

[54] Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational â¦ PDF

[55] Examining the potential and pitfalls of ChatGPT in science and engineering problem-solving PDF

[56] General-reasoner: Advancing llm reasoning across all domains PDF

[57] Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment PDF

[58] Mephisto: Self-Improving Large Language Model-Based Agents for Automated Interpretation of Multi-band Galaxy Observations PDF

[59] Scaling physical reasoning with the physics dataset PDF

[60] Maps: Advancing multi-modal reasoning in expert-level physical science PDF

Table of Contents

[54] Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational â¦ PDF