ATLAS: Alibaba Dataset and Benchmark for Learning-Augmented Scheduling

ICLR 2026 Conference SubmissionAnonymous Authors
Scheduling with predictionsDataset and benchmarkMachine learningLearning augmented schedulingNon-clairvoyant scheduling
Abstract:

Learning-augmented scheduling uses ML predictions to improve decision-making under uncertainty. Many algorithms in this class have been proposed with better theoretical guarantees than the classic methods. Translating these theoretical results into practice, however, requires an understanding of real workloads. Such an understanding is hard to develop because existing production traces either lack the ground-truth processing times or are not publicly available, while synthetic benchmarks fail to represent real-world complexity. We fill this gap by introducing Alibaba Trace for Learning-Augmented Scheduling (ATLAS), a research-ready dataset derived from Alibaba's Platform of Artificial Intelligence (PAI) cluster trace—a production system that processes hundreds of thousands of ML jobs per day. The ATLAS dataset has been cleaned and features engineered to represent the inputs and constraints of non-clairvoyant scheduling, including user tags, resource requests (CPU/GPU/memory), and job structures with ground-truth processing times. We develop a prediction benchmark reporting prediction error metrics, along with feature importance analysis, and introduce a novel multiple-stage ML model. We also provide a scheduling benchmark for minimizing the total completion time, max-stretch, and makespan. ATLAS is a reproducible foundation for researchers to study learning-augmented scheduling on real workloads, available at https://anonymous.4open.science/r/non-clairvoyant-with-predictions-7BF8/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ATLAS, a production-derived dataset for learning-augmented scheduling, along with a prediction benchmark and a multi-stage ML model. It resides in the 'Benchmarks, Datasets, and Evaluation Frameworks' leaf of the taxonomy, which contains only two papers total. This leaf is notably sparse compared to more crowded branches such as 'Cloud and Data Center Scheduling' (six papers) or 'Task Execution Time and Resource Prediction' (four papers). The scarcity of benchmark resources in this field underscores the potential value of a well-curated dataset, as most prior work has focused on algorithm design or prediction models rather than standardized evaluation infrastructure.

The taxonomy reveals that ATLAS sits at the intersection of multiple research directions. Neighboring leaves include 'Task Execution Time and Resource Prediction' (four papers on ML models for job duration forecasting) and 'Cloud and Data Center Scheduling' (six papers on system implementations). The taxonomy's scope notes clarify that benchmark work should provide empirical infrastructure rather than novel algorithms or prediction techniques. ATLAS connects to these adjacent areas by offering a testbed for evaluating both prediction models and scheduling algorithms, bridging the gap between theoretical frameworks in 'Prediction Quality and Robustness' (four papers) and practical deployment in application domains.

Among twenty-four candidates examined via limited semantic search, none were found to clearly refute any of the three contributions. The ATLAS dataset contribution examined ten candidates with zero refutable matches; the LASched benchmark similarly examined ten candidates with no overlaps; the multi-stage ML model examined four candidates, also with no refutations. This suggests that within the scope of top-K semantic matches, the work occupies a relatively uncontested niche. However, the limited search scale means that more exhaustive exploration of adjacent fields—particularly production trace datasets in cloud computing or ML workload characterization—might reveal closer prior work not captured by this analysis.

Given the sparse benchmark leaf and the absence of refutations among examined candidates, the work appears to address a recognized gap in the field's empirical infrastructure. The limited search scope (twenty-four candidates) and the taxonomy's structure suggest that while the core contributions are distinct within learning-augmented scheduling, broader literature on workload traces or ML system benchmarks may contain related efforts not fully captured here. The analysis reflects what is visible through targeted semantic search rather than exhaustive coverage of all relevant domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: learning-augmented scheduling with machine learning predictions. This emerging field integrates predictive models into scheduling algorithms to improve performance beyond worst-case guarantees. The taxonomy organizes research into five main branches: Theoretical Foundations and Algorithm Design explores robustness-consistency trade-offs and competitive analysis when predictions may be imperfect, as seen in works like Calibrated Predictions[1] and Speed Predictions[2]. Prediction Models and Machine Learning Techniques examines how to generate and refine forecasts of job durations, resource demands, or system states, with contributions such as Learned Weights[16] and Feature Based Jobs[30]. Application Domains and System Implementation translates these ideas into real systems—cloud scheduling (GPU Cluster Scheduling[11], Kubernetes Optimization[45]), energy management (Energy Efficient Predictions[4], Renewable Microgrid[37]), and manufacturing (Smart Manufacturing[31], Master Production Scheduling[43]). Benchmarks, Datasets, and Evaluation Frameworks provides the empirical infrastructure to assess prediction quality and algorithm performance. Finally, Routing and Hybrid Optimization Problems extends learning-augmented ideas to vehicle routing and mixed combinatorial settings, such as Routing Under Uncertainty[46]. Several active lines of work highlight key trade-offs and open questions. One thread investigates how to design algorithms that remain competitive even when predictions are noisy or adversarial, balancing trust in forecasts with worst-case safeguards (Untrusted Predictions[21], Non Clairvoyant Partial[5]). Another focuses on practical deployment in cloud and edge environments, where real-time decisions must incorporate uncertain execution times and dynamic workloads (Real Time Predictions[8], ElasticBatch[20]). ATLAS[0] sits squarely within the Benchmarks, Datasets, and Evaluation Frameworks branch, providing standardized testbeds and metrics to compare learning-augmented schedulers. Its emphasis on reproducible evaluation complements nearby efforts like Hybrid Prediction[41], which blends multiple forecasting sources, by offering a common ground for assessing how different prediction strategies translate into scheduling gains. This positions ATLAS[0] as an enabling resource that bridges theoretical algorithm design and empirical validation across diverse application domains.

Claimed Contributions

ATLAS dataset for learning-augmented scheduling

The authors introduce ATLAS, a dataset derived from Alibaba's production PAI cluster containing over 730,000 ML jobs with complete ground-truth processing times, submit-time features, and resource profiles. The dataset is specifically engineered for non-clairvoyant scheduling research, excluding post-execution metrics to prevent data leakage.

10 retrieved papers
LASched prediction and scheduling benchmark

The authors develop LASched, a standardized benchmark with two components: a prediction task that evaluates ML models for job size prediction using multiple error metrics, and a scheduling task that evaluates learning-augmented algorithms across three objectives (total completion time, max-stretch, makespan) with reproducible evaluation protocols.

10 retrieved papers
Novel multi-stage ML prediction model

The authors propose a multi-stage prediction approach that combines classification-first routing with specialized regressors and validation-based calibration methods (including conformal quantile regression, isotonic calibration, and meta-stacking) to achieve superior coverage metrics for job size prediction.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ATLAS dataset for learning-augmented scheduling

The authors introduce ATLAS, a dataset derived from Alibaba's production PAI cluster containing over 730,000 ML jobs with complete ground-truth processing times, submit-time features, and resource profiles. The dataset is specifically engineered for non-clairvoyant scheduling research, excluding post-execution metrics to prevent data leakage.

Contribution

LASched prediction and scheduling benchmark

The authors develop LASched, a standardized benchmark with two components: a prediction task that evaluates ML models for job size prediction using multiple error metrics, and a scheduling task that evaluates learning-augmented algorithms across three objectives (total completion time, max-stretch, makespan) with reproducible evaluation protocols.

Contribution

Novel multi-stage ML prediction model

The authors propose a multi-stage prediction approach that combines classification-first routing with specialized regressors and validation-based calibration methods (including conformal quantile regression, isotonic calibration, and meta-stacking) to achieve superior coverage metrics for job size prediction.