OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
Crystal sructure predictionDiffusion models for computational chemistryAI for science
Abstract:

Accurately predicting experimentally-realizable 3D3\textrm{D} molecular crystal structures from their 2D2\textrm{D} chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction\textit{crystal structure prediction} (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal\textrm{OXtal}, a large-scale 100M100\textrm{M} parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal\textrm{OXtal}, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling\textit{Stoichiometric Stochastic Shell Sampling} (S4S^4), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization---thus enabling more scalable architectural choices at all-atom resolution. Trained on 600K600 \text{K} experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal\textrm{OXtal} achieves orders-of-magnitude improvements over prior ab-initio\textit{ab-initio} ML CSP methods, which remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal\textrm{OXtal} reproduces experimental structures with conformer RMSD1<0.5\mathrm{RMSD}_1<0.5 Å and attains over 80% lattice-match success, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OXtal, a 100M-parameter all-atom diffusion model that learns the joint distribution over molecular conformations and periodic packing for crystal structure prediction. It resides in the Generative Models leaf under Machine Learning Approaches, alongside two sibling papers (GAN GCN Prediction and Machine Learning Lattice). This leaf represents a relatively sparse research direction within the broader taxonomy of 50 papers, indicating that end-to-end generative approaches to CSP remain an emerging frontier compared to traditional search algorithms and energy evaluation methods.

The taxonomy reveals that OXtal's immediate neighbors include Machine Learning-Accelerated Sampling (hybrid methods integrating ML potentials with traditional search) and Machine Learning Potentials (neural networks for energy prediction). These adjacent leaves focus on accelerating or refining existing workflows rather than replacing them with direct generation. Further afield, Core Prediction Methodologies encompasses evolutionary algorithms and Monte Carlo methods that dominate the field's history. OXtal diverges by abandoning explicit equivariant architectures and symmetry-based inductive biases in favor of data augmentation, positioning it as a departure from both classical search and symmetry-constrained ML approaches.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The core OXtal model examined 10 candidates with zero refutable matches, the S4 training scheme examined 2 candidates with zero refutable matches, and the performance claims examined 10 candidates with zero refutable matches. This suggests that within the limited search scope—focused on top-K semantic matches and citation expansion—the combination of large-scale diffusion modeling, lattice-free training, and all-atom resolution appears relatively unexplored. However, the small candidate pool (22 total) and the sparse Generative Models leaf (3 papers) mean this analysis captures only a narrow slice of the literature.

Based on the limited search scope, OXtal appears to occupy a novel position by combining diffusion-based generation with a crystallization-inspired training scheme at all-atom resolution. The absence of refuting candidates among 22 examined papers and the sparse population of the Generative Models leaf suggest the approach is relatively unexplored, though the analysis does not cover the full breadth of recent ML-CSP developments or adjacent fields like molecular generation and materials informatics.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: organic crystal structure prediction. The field aims to computationally determine how organic molecules arrange themselves in three-dimensional crystal lattices, a challenge central to pharmaceuticals, materials design, and fundamental chemistry. The taxonomy reflects a mature discipline organized around several complementary themes. Core Prediction Methodologies encompass traditional search algorithms (evolutionary methods like Evolutionary CSP[13], Monte Carlo sampling as in Symmetry Constrained Monte[41]) alongside emerging Machine Learning Approaches that include generative models and data-driven frameworks (e.g., Machine Learning Lattice[6], GAN GCN Prediction[26]). Energy Modeling and Evaluation addresses the accuracy of force fields and quantum-mechanical methods (Modelling Intermolecular Forces[43]), while Landscape Analysis and Polymorphism investigates the multiplicity of stable forms (Polymorphism Prediction[15], Conformational Polymorphism[29]). Specialized Systems and Extensions tackle chemically diverse targets such as hydrates, salts, and hybrid perovskites (Organic Hydrates[35], Hybrid Perovskites[9]), and Experimental Integration and Validation bridges computation with powder diffraction and NMR data (Powder XRD Prediction[4], Ephedrine NMR[19]). Benchmark Studies, notably the long-running blind tests (Second Blind Test[20] through Sixth Blind Test[21]), provide community-wide performance assessments, while Applications and Property Prediction extend predictions to functional properties like carrier mobility (Heteroatoms Carrier Mobility[25]) and Reviews offer periodic syntheses of progress (State of Art[10], Current Approaches[47]). Recent years have seen intensified interest in machine learning strategies that promise to accelerate search and refine energy rankings, yet traditional physics-based methods remain indispensable for reliability and interpretability. A key tension lies between the need for exhaustive sampling of conformational and packing space versus the computational cost of high-level energy evaluations, a trade-off that multifidelity and active learning schemes (Multifidelity Statistical ML[37], Active Learning Potentials[32]) attempt to resolve. Within this landscape, OXtal[0] sits squarely in the generative-model branch of Machine Learning Approaches, emphasizing end-to-end generation of crystal structures rather than iterative search. This contrasts with hybrid strategies like Powder XRD Prediction[4], which integrates experimental constraints into the prediction loop, and with GAN GCN Prediction[26], which similarly explores deep generative architectures but may differ in network design or training objectives. By framing crystal prediction as a direct generation task, OXtal[0] exemplifies the shift toward learning-based paradigms that complement—and potentially streamline—decades of search-and-rank workflows.

Claimed Contributions

OXtal: large-scale all-atom diffusion model for molecular CSP

The authors introduce OXtal, a 100M parameter all-atom diffusion model that learns the conditional joint distribution over intramolecular conformations and periodic packing for molecular crystals, conditioned solely on 2D molecular graphs. The model abandons explicit equivariant architectures in favor of data augmentation strategies to efficiently scale.

10 retrieved papers
Stoichiometric Stochastic Shell Sampling (S4) training scheme

The authors propose S4, a novel lattice-free training scheme inspired by crystallization processes that efficiently captures long-range interactions by building concentric shells around molecules based on contact distances. This approach sidesteps explicit lattice parametrization while preserving molecular stoichiometry, enabling more scalable architectural choices at all-atom resolution.

2 retrieved papers
Orders-of-magnitude improvements over prior ML CSP methods

The authors demonstrate that OXtal significantly outperforms existing machine learning-based ab initio CSP methods, recovering experimental structures with conformer RMSD1 < 0.5 Å and attaining over 80% lattice-match success. The model is also several orders of magnitude cheaper at inference time compared to traditional DFT-based quantum chemical approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OXtal: large-scale all-atom diffusion model for molecular CSP

The authors introduce OXtal, a 100M parameter all-atom diffusion model that learns the conditional joint distribution over intramolecular conformations and periodic packing for molecular crystals, conditioned solely on 2D molecular graphs. The model abandons explicit equivariant architectures in favor of data augmentation strategies to efficiently scale.

Contribution

Stoichiometric Stochastic Shell Sampling (S4) training scheme

The authors propose S4, a novel lattice-free training scheme inspired by crystallization processes that efficiently captures long-range interactions by building concentric shells around molecules based on contact distances. This approach sidesteps explicit lattice parametrization while preserving molecular stoichiometry, enabling more scalable architectural choices at all-atom resolution.

Contribution

Orders-of-magnitude improvements over prior ML CSP methods

The authors demonstrate that OXtal significantly outperforms existing machine learning-based ab initio CSP methods, recovering experimental structures with conformer RMSD1 < 0.5 Å and attaining over 80% lattice-match success. The model is also several orders of magnitude cheaper at inference time compared to traditional DFT-based quantum chemical approaches.