CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data

ICLR 2026 Conference SubmissionAnonymous Authors
Time Series Foundation ModelTime Series Classification
Abstract:

Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CauKer, a synthetic data generation pipeline combining Gaussian Process kernel composition with Structural Causal Models to produce causally coherent time series for pretraining classification foundation models. It resides in the 'Causal and Structural Approaches' leaf under 'Synthetic Data Generation Methods', which contains only two papers total. This sparse population suggests the specific combination of causal modeling and kernel composition for TSFM pretraining is relatively underexplored compared to the eight papers in the neighboring 'Deep Generative Models' leaf, indicating a less crowded research direction.

The taxonomy reveals that most synthetic generation work clusters around deep generative methods (GANs, VAEs, diffusion models) or symbolic pairing for multimodal learning, while causal and structural approaches remain a minority. The paper's focus on explicit causal coherence and GP kernels distinguishes it from purely latent-based generators and from symbolic methods that pair data with textual annotations. Neighboring branches like 'Pretraining Strategies' (e.g., Chronos, Lag-Llama) demonstrate large-scale pretraining on diverse synthetic corpora but typically do not emphasize causal structure in data generation, highlighting a methodological divergence.

Among thirty candidates examined, none clearly refuted any of the three contributions: the CauKer pipeline (ten candidates, zero refutable), scaling law demonstration (ten candidates, zero refutable), and sample-efficient pretraining (ten candidates, zero refutable). The single sibling paper in the same taxonomy leaf may address related causal kernel techniques but did not appear as a refuting candidate. This limited search scope suggests that within the examined literature, the specific integration of causal models and GP kernels for TSFM pretraining appears novel, though the analysis does not cover the full breadth of causal time series or synthetic data research.

Based on top-thirty semantic matches and the sparse taxonomy leaf, the work appears to occupy a relatively unexplored niche combining causal structure with kernel-based synthesis for foundation model pretraining. The absence of refuting candidates among examined papers and the small sibling set suggest novelty within the analyzed scope, though a broader search across causal inference or time series generation communities might reveal additional related efforts not captured here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: pretraining time series foundation models on synthetic data. The field organizes around four main branches that together address how synthetic data can enable scalable pretraining for time series. Synthetic Data Generation Methods explore diverse techniques—ranging from causal and structural approaches like CauKer[0] to GAN-based and diffusion-based generators—for creating realistic training signals. Pretraining Strategies and Architectures examine how models such as Chronos[7], Lag-Llama[11], and Timer[13] leverage these synthetic datasets alongside architectural choices (transformers, state-space models) to learn transferable representations. Application Domains and Task-Specific Models demonstrate the breadth of downstream uses, from finance (Finance Foundation Models[14]) and healthcare (EEG Classification[27]) to industrial IoT (IoT Synthetic[18]) and epidemic forecasting (Epidemic Forecasting[5]). Finally, Evaluation, Benchmarking, and Analysis investigates critical questions about data quality, zero-shot generalization (Zero-shot Anomaly Detection[2], Zero-shot Imputation[16]), and whether synthetic augmentation truly benefits model performance (Synthesize or Not[10], Synthetic vs Real[34]). A particularly active line of work contrasts purely data-driven generation (GANs, diffusion models) with methods that encode domain structure or causal mechanisms, trading off flexibility for interpretability and sample efficiency. CauKer[0] sits squarely within the Causal and Structural Approaches cluster, emphasizing how embedding causal knowledge into synthetic data generation can yield more robust pretraining signals than black-box methods. This contrasts with neighboring efforts like CauKer[1], which also explores causal kernels but may differ in how structural constraints are enforced or applied. Meanwhile, works such as Chronos[7] and Lag-Llama[11] demonstrate that large-scale pretraining on diverse (often purely synthetic) corpora can achieve strong zero-shot performance, raising open questions about when explicit causal modeling is necessary versus when sheer data scale suffices. The interplay between generation fidelity, structural inductive biases, and downstream task alignment remains a central theme across these branches.

Claimed Contributions

CAUKER synthetic data generation pipeline for time series classification

The authors introduce CAUKER, a synthetic data generation method that combines Gaussian Process kernel composition with Structural Causal Models to produce time series data suitable for pre-training classification foundation models. This approach generates sequences with both realistic temporal patterns and meaningful clustering structure for classification tasks.

10 retrieved papers
Demonstration of clear scaling laws for synthetic pre-training data

The authors show that pre-training on CAUKER-generated synthetic data reveals consistent scaling laws in both dataset size and model capacity, whereas real-world classification datasets exhibit irregular or absent scaling behavior. This represents the first systematic investigation of scaling laws in zero-shot time series classification.

10 retrieved papers
Sample-efficient pre-training achieving state-of-the-art classification performance

The authors demonstrate that time series foundation models pre-trained exclusively on CAUKER-generated synthetic data can match or nearly match the performance of models trained on much larger real-world datasets, achieving competitive state-of-the-art results while being significantly more sample-efficient.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CAUKER synthetic data generation pipeline for time series classification

The authors introduce CAUKER, a synthetic data generation method that combines Gaussian Process kernel composition with Structural Causal Models to produce time series data suitable for pre-training classification foundation models. This approach generates sequences with both realistic temporal patterns and meaningful clustering structure for classification tasks.

Contribution

Demonstration of clear scaling laws for synthetic pre-training data

The authors show that pre-training on CAUKER-generated synthetic data reveals consistent scaling laws in both dataset size and model capacity, whereas real-world classification datasets exhibit irregular or absent scaling behavior. This represents the first systematic investigation of scaling laws in zero-shot time series classification.

Contribution

Sample-efficient pre-training achieving state-of-the-art classification performance

The authors demonstrate that time series foundation models pre-trained exclusively on CAUKER-generated synthetic data can match or nearly match the performance of models trained on much larger real-world datasets, achieving competitive state-of-the-art results while being significantly more sample-efficient.