Aurora: Towards Universal Generative Multimodal Time Series Forecasting

ICLR 2026 Conference SubmissionAnonymous Authors
Time Series ForecastingMultimodality
Abstract:

Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Aurora proposes a multimodal time series foundation model that integrates text and image modalities for zero-shot cross-domain forecasting. The paper resides in the 'Generative and Probabilistic Multimodal Models' leaf, which contains only two papers including Aurora itself. This sparse population suggests the intersection of generative modeling and multimodal time series forecasting remains relatively underexplored. The sibling paper Multi-Modal Forecaster also employs generative fusion but appears to prioritize within-domain accuracy over cross-domain robustness, indicating Aurora targets a distinct design goal within this small research cluster.

The taxonomy reveals Aurora sits at the intersection of multiple research directions. Its parent branch 'Multimodal Fusion Architectures for Time Series' neighbors 'Attention-Based Cross-Modal Fusion' (three papers) and 'Perturbation-Aware and Robust Fusion' (two papers), suggesting alternative fusion strategies exist. Meanwhile, the 'Cross-Domain Transfer and Adaptation Methods' branch contains domain adaptive networks and few-shot adaptation techniques that address generalization without necessarily incorporating multimodal inputs. Aurora's positioning suggests it bridges generative multimodal fusion with cross-domain transfer objectives, a combination less populated than either direction individually.

Among twenty candidates examined across three contributions, only one refutable pair emerged for the core Aurora model contribution (examined ten candidates, one refutable). The modality-guided attention mechanism showed no refutations across ten candidates, suggesting this architectural component may offer more novelty within the limited search scope. The prototype-guided flow matching contribution was not evaluated against prior work in this analysis. These statistics indicate that among the top-twenty semantically similar papers examined, most do not directly overlap with Aurora's specific combination of generative modeling, multimodal fusion, and zero-shot cross-domain forecasting.

Based on the limited search scope of twenty candidates, Aurora appears to occupy a relatively sparse intersection of research directions. The analysis does not cover exhaustive literature review or domain-specific forecasting applications outside the examined set. The single refutable candidate for the core model suggests some prior work exists in related areas, though the specifics of overlap remain unclear from the provided statistics alone.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: multimodal time series forecasting with cross-domain generalization. This field addresses the challenge of predicting future temporal patterns by integrating diverse data modalities (e.g., numerical series, text, images) while ensuring models generalize across different application domains. The taxonomy reveals several complementary research directions. LLM-Based Multimodal Forecasting Frameworks explore how large language models can unify heterogeneous inputs for prediction tasks, as seen in works like LangTime[2] and MMGPT4LF[6]. Multimodal Fusion Architectures for Time Series develop specialized neural designs that combine temporal and cross-modal dependencies, including generative and probabilistic approaches such as Multi-Modal Forecaster[30]. Cross-Domain Transfer and Adaptation Methods focus on enabling models trained in one domain to perform well in another, exemplified by Cross Domain Transformer[4] and Domain Transfer Spatiotemporal[13]. Meanwhile, Multimodal Datasets and Benchmarking efforts like Time-MMD[18] provide standardized evaluation resources, and Domain-Specific Multimodal Applications demonstrate practical deployments in finance, healthcare, and other sectors. Recent work highlights tensions between model complexity and generalization robustness. Some studies pursue deep fusion mechanisms that tightly integrate modalities, while others investigate whether simpler alignment strategies suffice, as questioned by Does Multimodality Lead[14] and When Multimodality Better[34]. Aurora[0] sits within the Generative and Probabilistic Multimodal Models cluster, emphasizing probabilistic modeling to handle uncertainty in cross-domain scenarios. Compared to Multi-Modal Forecaster[30], which also adopts generative fusion, Aurora[0] appears to place stronger emphasis on cross-domain robustness rather than purely within-domain accuracy. This contrasts with deterministic fusion approaches like Text Reinforcement Multimodal[3], which leverage reinforcement signals but may sacrifice probabilistic calibration. Open questions remain about how to balance expressive multimodal representations with the need for domain-invariant features that transfer reliably across diverse forecasting contexts.

Claimed Contributions

Aurora: A Multimodal Time Series Foundation Model

The authors introduce Aurora, a foundation model pretrained on cross-domain multimodal time series data that accepts text and image inputs alongside time series. It supports zero-shot inference and generative probabilistic forecasting by fusing multimodal domain knowledge to enhance cross-domain generalization.

10 retrieved papers
Can Refute
Modality-Guided Multi-head Self-Attention Mechanism

The authors design a cross-modality encoder that distills key information from text and image tokens, then uses a Modality-Guided Multi-head Self-Attention mechanism to inject external domain knowledge into temporal feature modeling, thereby enhancing temporal representations.

10 retrieved papers
Prototype-Guided Flow Matching for Generative Forecasting

The authors propose a novel flow-matching approach that generates multimodal conditions via a Condition Decoder and retrieves future prototypes (containing periodicity and trend) from a Prototype Bank as starting points, replacing standard Gaussian initialization to simplify and enhance the generative probabilistic forecasting process.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Aurora: A Multimodal Time Series Foundation Model

The authors introduce Aurora, a foundation model pretrained on cross-domain multimodal time series data that accepts text and image inputs alongside time series. It supports zero-shot inference and generative probabilistic forecasting by fusing multimodal domain knowledge to enhance cross-domain generalization.

Contribution

Modality-Guided Multi-head Self-Attention Mechanism

The authors design a cross-modality encoder that distills key information from text and image tokens, then uses a Modality-Guided Multi-head Self-Attention mechanism to inject external domain knowledge into temporal feature modeling, thereby enhancing temporal representations.

Contribution

Prototype-Guided Flow Matching for Generative Forecasting

The authors propose a novel flow-matching approach that generates multimodal conditions via a Condition Decoder and retrieves future prototypes (containing periodicity and trend) from a Prototype Bank as starting points, replacing standard Gaussian initialization to simplify and enhance the generative probabilistic forecasting process.

Aurora: Towards Universal Generative Multimodal Time Series Forecasting | Novelty Validation