CatalystBench: A Comprehensive Multi-Task Benchmark for Advancing Language Models in Catalysis Science

ICLR 2026 Conference SubmissionAnonymous Authors
Scientific BenchmarkAI for ScienceCatalyst DesignLarge Language ModelsMulti-task LearningDomain Adaptation
Abstract:

The discovery of novel catalytic materials is a cornerstone of chemical engineering and sustainable energy, yet it remains a complex, knowledge-intensive process. While Large Language Models (LLMs) have demonstrated remarkable potential in various scientific domains, their application to catalysis is hindered by the lack of specialized, multi-dimensional benchmarks to guide their development and evaluation. To bridge the critical gap, we introduce CatalystBench, a comprehensive and challenging benchmark meticulously constructed from scientific literature and public datasets, specifically designed to assess the capabilities of LLMs in the nuanced domain of catalyst design. The tasks covered by this benchmark dataset encompass the entire closed-loop process of catalyst development, including reading comprehension, experimental analysis, and scheme reasoning. Based on this benchmark, we propose a Multi-head Full-task (MFT) domain-specific fine-tuning method that employs coupling task-specific output heads. We systematically compare with other three distinct fine-tuning strategies: Single-Task (ST), Full-Task (FT) and Multi-head Single-Task (MST). The extensive experiments demonstrate that the MFT strategy consistently achieves the most substantial performance improvements across all tasks, underscoring the effectiveness of explicit multi-task architectures in complex scientific reasoning. The resulting CatalystLLM significantly outperforms a wide array of state-of-the-art open-source and closed-source models on CatalystBench. We will publicly release both the CatalystBench benchmark and the CatalystLLM model, providing the community with a robust evaluation framework and a powerful new tool to accelerate AI-driven research in catalytic materials science.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CatalystBench, a multi-task benchmark covering reading comprehension, experimental analysis, and scheme reasoning across the catalyst development lifecycle, alongside a Multi-head Full-task fine-tuning method. It resides in the Multi-Task Catalysis Benchmarks leaf, which contains only two papers including this one. This represents a sparse research direction within the broader taxonomy of 31 papers across 16 leaf nodes, suggesting the work addresses an emerging rather than saturated area of inquiry.

The taxonomy reveals that benchmark development is one of five major branches, with neighboring leaves focusing on Domain-Specific Chemistry Benchmarks and Materials Synthesis and Discovery Benchmarks. The scope note for Multi-Task Catalysis Benchmarks explicitly excludes single-task or chemistry-general evaluations, positioning this work as distinct from broader chemistry foundation models and specialized prediction tasks. The sibling paper in this leaf appears to share the multi-task catalysis focus, indicating a nascent but coherent research thread within the field.

Among 30 candidates examined, none clearly refute any of the three contributions: the benchmark itself, the MFT fine-tuning strategy, or the CatalystLLM model. Each contribution was assessed against 10 candidates with zero refutable overlaps identified. This suggests that within the limited search scope, the combination of a comprehensive multi-task catalysis benchmark and the proposed fine-tuning architecture does not have direct precedents in the examined literature, though the search scale precludes exhaustive claims about absolute novelty.

Based on the top-30 semantic matches and taxonomy structure, the work appears to occupy a relatively unexplored niche at the intersection of multi-task benchmarking and catalysis-specific language modeling. The sparse population of the Multi-Task Catalysis Benchmarks leaf and absence of refuting candidates within the examined scope suggest meaningful differentiation from existing efforts, though broader literature beyond the search scope may contain relevant prior work not captured here.

Taxonomy

Core-task Taxonomy Papers
31
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Advancing language models in catalysis science through multi-task benchmarking. The field structure reflects a maturing effort to integrate modern language models into catalysis research, organized around five main branches. Benchmark Development and Evaluation Frameworks focuses on creating standardized test suites and metrics to assess model performance across diverse catalysis tasks, exemplified by works like CatalystBench[0] and CITE[26]. Specialized Prediction Tasks Using Pre-Trained Models adapts existing language architectures to specific catalysis challenges such as reaction prediction, property estimation, and catalyst screening, with contributions including MPEK Enzymatic[1] and ChemReasoner[8]. Application-Driven LLM Deployment in Catalysis emphasizes end-to-end systems for literature mining, experimental design, and discovery workflows, as seen in Catalysis Literature Distillation[13] and Automated Materials Discovery[12]. Multimodal and Spatiotemporal LLM Architectures explores models that integrate molecular images, temporal dynamics, and spatial information, while Supporting Methodologies and Tools provides foundational techniques for data curation, domain adaptation, and risk assessment. A particularly active line of work centers on multi-task benchmarking, where researchers seek to evaluate language models across heterogeneous catalysis problems rather than isolated prediction targets. CatalystBench[0] sits squarely within this effort, proposing a comprehensive evaluation framework that spans multiple catalysis subtasks. This contrasts with more narrowly scoped benchmarks like CITE[26], which focuses on specific catalyst information extraction, and with application-driven systems such as Catalysis Literature Distillation[13] that prioritize deployment over systematic evaluation. The main trade-off across these directions involves breadth versus depth: broad multi-task benchmarks offer holistic assessments but may sacrifice task-specific optimization, while specialized prediction models and targeted applications achieve higher performance on individual problems at the cost of generalizability. Open questions include how to balance task diversity with meaningful performance metrics and whether unified benchmarks can drive progress across the fragmented landscape of catalysis subdomains.

Claimed Contributions

CatalystBench: A comprehensive multi-task benchmark for catalysis science

The authors construct CatalystBench, a novel benchmark dataset that combines high-fidelity theoretical datasets from DFT calculations with curated experimental literature. It covers eight diverse tasks spanning the entire catalyst development lifecycle, including reading comprehension, experimental analysis, and scheme reasoning, formatted as structured Q&A pairs.

10 retrieved papers
Multi-head Full-task (MFT) fine-tuning strategy

The authors propose MFT, a fine-tuning method that employs task-specific output heads (classification, regression, and language modeling heads) trained in parallel on a shared backbone. This architectural decoupling prevents interference between qualitatively different objectives while enabling cross-task knowledge transfer in catalyst design workflows.

10 retrieved papers
CatalystLLM: A domain-specific language model for catalysis

The authors develop CatalystLLM by applying their MFT strategy to fine-tune ChemLLM-7B on CatalystBench. Through systematic experiments, they demonstrate that CatalystLLM achieves state-of-the-art performance across all benchmark tasks, significantly outperforming both general-purpose and domain-specific language models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CatalystBench: A comprehensive multi-task benchmark for catalysis science

The authors construct CatalystBench, a novel benchmark dataset that combines high-fidelity theoretical datasets from DFT calculations with curated experimental literature. It covers eight diverse tasks spanning the entire catalyst development lifecycle, including reading comprehension, experimental analysis, and scheme reasoning, formatted as structured Q&A pairs.

Contribution

Multi-head Full-task (MFT) fine-tuning strategy

The authors propose MFT, a fine-tuning method that employs task-specific output heads (classification, regression, and language modeling heads) trained in parallel on a shared backbone. This architectural decoupling prevents interference between qualitatively different objectives while enabling cross-task knowledge transfer in catalyst design workflows.

Contribution

CatalystLLM: A domain-specific language model for catalysis

The authors develop CatalystLLM by applying their MFT strategy to fine-tune ChemLLM-7B on CatalystBench. Through systematic experiments, they demonstrate that CatalystLLM achieves state-of-the-art performance across all benchmark tasks, significantly outperforming both general-purpose and domain-specific language models.