RealBench: A Benchmark for Complex Physical Systems with Real-World Data

ICLR 2026 Conference SubmissionAnonymous Authors
complex physical systemPDEbenchmarkreal-world dataprediction
Abstract:

Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real-world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim-to-real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines. We first present five real-world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real-world and simulated data, and facilitate the development of methods to bridge the two. Moreover, we design eight evaluation metrics, spanning data-oriented and physics-oriented metrics, and finally benchmark ten representative baselines, including state-of-the-art models, pretrained PDE foundation models, and a traditional method. Experiments reveal significant discrepancies between simulated and real-world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence. In this work, we hope to provide insights from real-world data, advancing scientific ML toward bridging the sim-to-real gap and real-world deployment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RealPDEBench, a benchmark integrating real-world measurements with paired numerical simulations for scientific machine learning. It resides in the 'Physics-Informed Neural Networks and Uncertainty Quantification' leaf, which contains only three papers total. This leaf sits within the broader 'Physics-Informed and Hybrid Modeling Approaches' branch, indicating a relatively sparse research direction compared to the more crowded robotic control branches (15 papers across three leaves). The focus on benchmark infrastructure for PDE prediction distinguishes it from the sibling papers, which emphasize calibration methods and graph-based physics engines.

The taxonomy reveals that neighboring leaves address hybrid transfer learning with physics priors (1 paper) and model-based reinforcement learning (3 papers), both emphasizing policy learning rather than benchmark construction. The broader field structure shows that most sim-to-real work concentrates on robotic control (15 papers) and digital twin monitoring (13 papers), with physics-informed modeling receiving less attention (5 papers total). RealPDEBench diverges from these directions by targeting scientific ML evaluation infrastructure rather than control policies or industrial monitoring, occupying a niche at the intersection of data-driven learning and physics-based simulation validation.

Among 30 candidates examined, the contribution-level analysis shows varied novelty profiles. The paired real-world and simulated dataset contribution (10 candidates examined, 0 refutable) appears most distinctive, as no prior work provides this specific benchmark infrastructure. The three task categories (10 candidates, 0 refutable) also show no direct overlap. However, the comprehensive evaluation framework (10 candidates, 1 refutable) encounters at least one candidate offering overlapping metrics or evaluation approaches. Given the limited search scope, these statistics suggest the benchmark infrastructure itself is relatively novel, while the evaluation methodology has more substantial prior work within the examined candidates.

Based on the top-30 semantic matches and taxonomy structure, the work addresses a sparse research direction with limited direct competition in its specific leaf. The benchmark contribution appears more novel than the evaluation framework, though the restricted search scope means additional relevant work may exist beyond the candidates examined. The taxonomy context suggests this represents a meaningful but incremental step in a less-explored corner of the broader sim-to-real transfer landscape.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: bridging the sim-to-real gap in complex physical system prediction. The field addresses the challenge of ensuring that models trained or validated in simulation can reliably predict real-world behavior across diverse physical systems. The taxonomy reveals five main branches: Sim-to-Real Transfer Methods for Robotic Control focuses on domain randomization, policy adaptation, and reinforcement learning techniques that enable robots to generalize from synthetic to physical environments (e.g., Dynamics Randomization[5], Closing Sim-to-Real Loop[7]). Surveys and Frameworks provide conceptual overviews and methodological guidance (Sim-to-Real Survey[3]). Autonomous Systems and Embodied AI emphasize end-to-end learning and perception-action loops in agents operating in real environments (Embodied AI Survey[2]). Digital Twins and Virtual Monitoring Systems create persistent virtual replicas for industrial assets, infrastructure, and energy systems (Digital Twin Turbines[11], Multi Digital Twin[14]). Physics-Informed and Hybrid Modeling Approaches integrate domain knowledge, neural networks, and uncertainty quantification to improve predictive fidelity and calibration (Calibrated Physics Informed[15], Graph Physics Engines[45]). A particularly active line of work explores how to blend data-driven flexibility with physical constraints, trading off model expressiveness against interpretability and sample efficiency. Another contrasting theme is whether to adapt simulators to match reality through system identification and calibration, or to learn robust policies that tolerate discrepancies via randomization and domain adaptation. RealBench[0] sits within the Physics-Informed and Hybrid Modeling branch, specifically addressing physics-informed neural networks and uncertainty quantification. It shares methodological kinship with Calibrated Physics Informed[15], which also emphasizes calibration and uncertainty-aware prediction, and with Graph Physics Engines[45], which leverages structured representations of physical interactions. Where some works prioritize pure learning or pure physics, RealBench[0] occupies a middle ground by systematically benchmarking how well hybrid approaches can close the sim-to-real gap when physical priors and neural flexibility are combined with rigorous uncertainty estimates.

Claimed Contributions

RealPDEBench benchmark with paired real-world and simulated data

The authors present RealPDEBench, the first scientific ML benchmark that systematically pairs real-world experimental measurements with numerical simulations across five complex physical systems. This benchmark includes more than 700 trajectories covering fluid dynamics and combustion scenarios, enabling systematic evaluation of models on real-world data and investigation of the sim-to-real gap.

10 retrieved papers
Three task categories for comparing real-world and simulated data

The authors define three training paradigms: training on simulated data, training on real-world data, and pretraining on simulated data followed by finetuning on real-world data. These tasks enable systematic comparison of the strengths and limitations of both data types and provide a foundation for developing methods that effectively combine them.

10 retrieved papers
Comprehensive evaluation framework with data-oriented and physics-oriented metrics

The authors introduce a comprehensive evaluation framework consisting of eight metrics that assess model performance from both data-oriented perspectives (such as RMSE and MAE) and physics-oriented perspectives (such as Fourier Space Error and Kinetic Energy Error). They benchmark ten representative baselines including state-of-the-art models and pretrained foundation models using this framework.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RealPDEBench benchmark with paired real-world and simulated data

The authors present RealPDEBench, the first scientific ML benchmark that systematically pairs real-world experimental measurements with numerical simulations across five complex physical systems. This benchmark includes more than 700 trajectories covering fluid dynamics and combustion scenarios, enabling systematic evaluation of models on real-world data and investigation of the sim-to-real gap.

Contribution

Three task categories for comparing real-world and simulated data

The authors define three training paradigms: training on simulated data, training on real-world data, and pretraining on simulated data followed by finetuning on real-world data. These tasks enable systematic comparison of the strengths and limitations of both data types and provide a foundation for developing methods that effectively combine them.

Contribution

Comprehensive evaluation framework with data-oriented and physics-oriented metrics

The authors introduce a comprehensive evaluation framework consisting of eight metrics that assess model performance from both data-oriented perspectives (such as RMSE and MAE) and physics-oriented perspectives (such as Fourier Space Error and Kinetic Energy Error). They benchmark ten representative baselines including state-of-the-art models and pretrained foundation models using this framework.

RealBench: A Benchmark for Complex Physical Systems with Real-World Data | Novelty Validation