Aurora: Towards Universal Generative Multimodal Time Series Forecasting

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Time Series ForecastingMultimodality

Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Aurora proposes a multimodal time series foundation model that integrates text and image modalities for zero-shot cross-domain forecasting. The paper resides in the 'Generative and Probabilistic Multimodal Models' leaf, which contains only two papers including Aurora itself. This sparse population suggests the intersection of generative modeling and multimodal time series forecasting remains relatively underexplored. The sibling paper Multi-Modal Forecaster also employs generative fusion but appears to prioritize within-domain accuracy over cross-domain robustness, indicating Aurora targets a distinct design goal within this small research cluster.

The taxonomy reveals Aurora sits at the intersection of multiple research directions. Its parent branch 'Multimodal Fusion Architectures for Time Series' neighbors 'Attention-Based Cross-Modal Fusion' (three papers) and 'Perturbation-Aware and Robust Fusion' (two papers), suggesting alternative fusion strategies exist. Meanwhile, the 'Cross-Domain Transfer and Adaptation Methods' branch contains domain adaptive networks and few-shot adaptation techniques that address generalization without necessarily incorporating multimodal inputs. Aurora's positioning suggests it bridges generative multimodal fusion with cross-domain transfer objectives, a combination less populated than either direction individually.

Among twenty candidates examined across three contributions, only one refutable pair emerged for the core Aurora model contribution (examined ten candidates, one refutable). The modality-guided attention mechanism showed no refutations across ten candidates, suggesting this architectural component may offer more novelty within the limited search scope. The prototype-guided flow matching contribution was not evaluated against prior work in this analysis. These statistics indicate that among the top-twenty semantically similar papers examined, most do not directly overlap with Aurora's specific combination of generative modeling, multimodal fusion, and zero-shot cross-domain forecasting.

Based on the limited search scope of twenty candidates, Aurora appears to occupy a relatively sparse intersection of research directions. The analysis does not cover exhaustive literature review or domain-specific forecasting applications outside the examined set. The single refutable candidate for the core model suggests some prior work exists in related areas, though the specifics of overlap remain unclear from the provided statistics alone.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal time series forecasting with cross-domain generalization. This field addresses the challenge of predicting future temporal patterns by integrating diverse data modalities (e.g., numerical series, text, images) while ensuring models generalize across different application domains. The taxonomy reveals several complementary research directions. LLM-Based Multimodal Forecasting Frameworks explore how large language models can unify heterogeneous inputs for prediction tasks, as seen in works like LangTime[2] and MMGPT4LF[6]. Multimodal Fusion Architectures for Time Series develop specialized neural designs that combine temporal and cross-modal dependencies, including generative and probabilistic approaches such as Multi-Modal Forecaster[30]. Cross-Domain Transfer and Adaptation Methods focus on enabling models trained in one domain to perform well in another, exemplified by Cross Domain Transformer[4] and Domain Transfer Spatiotemporal[13]. Meanwhile, Multimodal Datasets and Benchmarking efforts like Time-MMD[18] provide standardized evaluation resources, and Domain-Specific Multimodal Applications demonstrate practical deployments in finance, healthcare, and other sectors. Recent work highlights tensions between model complexity and generalization robustness. Some studies pursue deep fusion mechanisms that tightly integrate modalities, while others investigate whether simpler alignment strategies suffice, as questioned by Does Multimodality Lead[14] and When Multimodality Better[34]. Aurora[0] sits within the Generative and Probabilistic Multimodal Models cluster, emphasizing probabilistic modeling to handle uncertainty in cross-domain scenarios. Compared to Multi-Modal Forecaster[30], which also adopts generative fusion, Aurora[0] appears to place stronger emphasis on cross-domain robustness rather than purely within-domain accuracy. This contrasts with deterministic fusion approaches like Text Reinforcement Multimodal[3], which leverage reinforcement signals but may sacrifice probabilistic calibration. Open questions remain about how to balance expressive multimodal representations with the need for domain-invariant features that transfer reliably across diverse forecasting contexts.

Claimed Contributions

Aurora: A Multimodal Time Series Foundation Model

Can Refute

10 retrieved papers

The authors introduce Aurora, a foundation model pretrained on cross-domain multimodal time series data that accepts text and image inputs alongside time series. It supports zero-shot inference and generative probabilistic forecasting by fusing multimodal domain knowledge to enhance cross-domain generalization.

10 retrieved papers

Can Refute

Modality-Guided Multi-head Self-Attention Mechanism

10 retrieved papers

The authors design a cross-modality encoder that distills key information from text and image tokens, then uses a Modality-Guided Multi-head Self-Attention mechanism to inject external domain knowledge into temporal feature modeling, thereby enhancing temporal representations.

10 retrieved papers

Prototype-Guided Flow Matching for Generative Forecasting

0 retrieved papers

The authors propose a novel flow-matching approach that generates multimodal conditions via a Condition Decoder and retrieves future prototypes (containing periodicity and trend) from a Prototype Bank as starting points, replacing standard Gaussian initialization to simplify and enhance the generative probabilistic forecasting process.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[30] Multi-modal forecaster: Jointly predicting time series and textual data PDF

Kai Kim, Sen, Rajat, Howard Tsai, Das, Abhimanyu, Rajat Sen, Zhou Zi-hao, Abhimanyu Das, Zihao Zhou, Abhishek Tanpure, Yu, Rose, Mathew Luo, Rose Yu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Aurora: A Multimodal Time Series Foundation Model

[67] Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting PDF

Can Refute

[7] UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting PDF

Cannot Refute

[14] Does Multimodality Lead to Better Time Series Forecasting? PDF

Cannot Refute

[34] When Does Multimodality Lead to Better Time Series Forecasting? PDF

Cannot Refute

[61] Foundation models for time series analysis: A tutorial and survey PDF

Cannot Refute

[62] Multimodal Conditioned Diffusive Time Series Forecasting PDF

Cannot Refute

[63] GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting PDF

Cannot Refute

[64] Low-Rank Adaptation of Time Series Foundational Models for Out-of-Domain Modality Forecasting PDF

Cannot Refute

[65] On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence PDF

Cannot Refute

[66] UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting PDF

Cannot Refute

Contribution

Modality-Guided Multi-head Self-Attention Mechanism

[51] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

Cannot Refute

[52] Attention-based multimodal fusion for video description PDF

Cannot Refute

[53] Causal-Aware Multimodal Transformer for Supply Chain Demand Forecasting: Integrating Text, Time Series, and Satellite Imagery PDF

Cannot Refute

[54] GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation PDF

Cannot Refute

[55] HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation PDF

Cannot Refute

[56] Personalized Multimodal Emotion Recognition: Integrating Temporal Dynamics and Individual Traits for Enhanced Performance PDF

Cannot Refute

[57] Attending to customer attention: A novel deep learning method for leveraging multimodal online reviews to enhance sales prediction PDF

Cannot Refute

[58] Leveraging Foundation Models for Multimodal Graph-Based Action Recognition PDF

Cannot Refute

[59] Multimodal Deep Learning for Video Classification PDF

Cannot Refute

[60] Vision-text cross-modal fusion for accurate video captioning PDF

Cannot Refute

Contribution

Aurora: Towards Universal Generative Multimodal Time Series Forecasting

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[30] Multi-modal forecaster: Jointly predicting time series and textual data PDF

Contribution Analysis

Aurora: A Multimodal Time Series Foundation Model

[67] Time-vlm: Exploring multimodal vision-language models for augmented time series forecasting PDF

[7] UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting PDF

[14] Does Multimodality Lead to Better Time Series Forecasting? PDF

[34] When Does Multimodality Lead to Better Time Series Forecasting? PDF

[61] Foundation models for time series analysis: A tutorial and survey PDF

[62] Multimodal Conditioned Diffusive Time Series Forecasting PDF

[63] GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting PDF

[64] Low-Rank Adaptation of Time Series Foundational Models for Out-of-Domain Modality Forecasting PDF

[65] On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence PDF

[66] UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting PDF

Modality-Guided Multi-head Self-Attention Mechanism

[51] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

[52] Attention-based multimodal fusion for video description PDF

[53] Causal-Aware Multimodal Transformer for Supply Chain Demand Forecasting: Integrating Text, Time Series, and Satellite Imagery PDF

[54] GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation PDF

[55] HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation PDF

[56] Personalized Multimodal Emotion Recognition: Integrating Temporal Dynamics and Individual Traits for Enhanced Performance PDF

[57] Attending to customer attention: A novel deep learning method for leveraging multimodal online reviews to enhance sales prediction PDF

[58] Leveraging Foundation Models for Multimodal Graph-Based Action Recognition PDF

[59] Multimodal Deep Learning for Video Classification PDF

[60] Vision-text cross-modal fusion for accurate video captioning PDF

Prototype-Guided Flow Matching for Generative Forecasting

Table of Contents