Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

EEGECGDeep learningTransformer

Accurate analysis of Medical time series (MedTS) data, such as Electroencephalography (EEG) and Electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibits two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle to model channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To bridge this gap, we propose CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module tailored to replace the decentralized attention. Instead of allowing all tokens to interact directly, as in attention, CoTAR introduces a global core token that acts as a proxy to facilitate the inter-token interaction, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 12.13% improvement on the APAVA dataset, with merely 33% memory usage and 20% inference time compared to the previous state-of-the-art. Code and all training scripts are available in this Link.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CoTAR, a centralized MLP-based module replacing decentralized attention in transformers for medical time series analysis. It sits within the Transformer-Based Joint Modeling leaf, which contains only three papers including this work. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific approach of replacing attention with centralized token aggregation for joint temporal-channel modeling is not yet heavily explored in the medical time series literature.

The taxonomy reveals that joint spatiotemporal modeling represents one of several major branches, alongside temporal-only architectures (RNNs, attention-based temporal models) and channel-only methods (graph-based, MLP-based channel mixing). The paper's sibling works—Medformer and Dispformer—employ standard transformer blocks for joint modeling, while neighboring leaves include hybrid convolutional-recurrent architectures and multi-scale approaches. The scope notes indicate that transformers focusing solely on temporal attention belong elsewhere, clarifying that this leaf specifically addresses integrated temporal-channel mechanisms. The paper diverges from graph-based channel modeling and attention-based channel mixing by introducing a centralized proxy token rather than direct pairwise interactions.

Among 30 candidates examined, the CoTAR module shows one refutable candidate from 10 examined, while the TeCh framework with Adaptive Dual Tokenization also has one refutable candidate from 10 examined. The identification of structural mismatch between attention and medical time series appears more novel, with zero refutable candidates among 10 examined. This suggests that while the specific architectural components may have some precedent in the limited search scope, the conceptual framing of centralized versus decentralized modeling for medical signals represents a less-explored perspective. The analysis indicates moderate prior work overlap for the core technical contributions but stronger novelty in the problem formulation.

Based on the top-30 semantic matches examined, the work appears to occupy a relatively underexplored niche within transformer-based joint modeling, though the limited search scope means potentially relevant work outside this candidate set remains unexamined. The sparse population of the taxonomy leaf and the conceptual novelty of the centralization argument suggest meaningful differentiation from existing approaches, while the refutable candidates for specific modules indicate some architectural overlap within the examined literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Modeling temporal and channel dependencies in medical time series. The field addresses the dual challenge of capturing how clinical variables evolve over time and how they interact with one another across channels. The taxonomy reveals several major branches: Temporal Dependency Modeling Architectures focus on sequential patterns using RNNs, LSTMs, and attention mechanisms; Channel Dependency and Inter-Variable Modeling emphasizes cross-variable relationships through graph-based or correlation-driven methods; Joint Spatiotemporal Modeling integrates both dimensions simultaneously, often via transformers or hybrid architectures; Data Imputation and Reconstruction tackles missing data; Clinical Prediction and Classification applies these models to diagnostic tasks; Representation Learning and Self-Supervision explores unsupervised pretraining; and Specialized Modeling Techniques covers domain-specific innovations. Representative works such as Medformer[1] and Dispformer[39] illustrate transformer-based joint modeling, while approaches like Spatiotemporal Graph Medical[2] and Channel Independence Mamba[7] highlight contrasting strategies for handling variable interactions. A central tension in the field lies between methods that explicitly model channel dependencies versus those that treat channels independently to reduce complexity. Transformer-based joint modeling has emerged as a particularly active direction, balancing expressiveness with computational feasibility. Decentralized Attention Medical[0] sits within this branch alongside Medformer[1] and Dispformer[39], emphasizing efficient attention mechanisms that capture both temporal evolution and inter-channel relationships without prohibitive computational costs. Compared to Medformer[1], which typically employs standard transformer blocks, Decentralized Attention Medical[0] appears to explore alternative attention designs that may distribute computation or focus on localized dependencies. Meanwhile, works like SimTA[5] and Clinical ICD Coding[3] demonstrate how joint modeling supports diverse downstream tasks, from representation learning to multi-label classification, underscoring ongoing questions about how best to balance model capacity, interpretability, and clinical utility in irregular, high-dimensional medical time series.

Claimed Contributions

Core Token Aggregation-Redistribution (CoTAR) module

Can Refute

10 retrieved papers

CoTAR is a centralized MLP-based module that replaces the standard attention mechanism in Transformers. Instead of direct pairwise token interactions, it introduces a global core token that aggregates information from all tokens and redistributes it back, reducing computational complexity from quadratic to linear while better aligning with the centralized nature of medical time series signals.

10 retrieved papers

Can Refute

TeCh framework with Adaptive Dual Tokenization

Can Refute

10 retrieved papers

TeCh is a unified framework built on CoTAR that can adaptively model temporal dependencies, channel dependencies, or both by adjusting the tokenization strategy (Temporal, Channel, or Dual). This flexibility allows the framework to better match the unique characteristics of different medical time series datasets.

10 retrieved papers

Can Refute

Identification of structural mismatch between attention and medical time series

10 retrieved papers

The authors identify and formalize a fundamental mismatch: medical time series signals like EEG and ECG originate from centralized biological sources (brain, heart), while Transformer attention operates as a decentralized graph where all tokens interact equally. This mismatch causes attention to fail at capturing the global synchronization and unified patterns essential for modeling channel dependencies in medical signals.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Medformer: A multi-granularity patching transformer for medical time-series classification PDF

Nan Huang, Taida Li, Yihe Wang, Yujun Yan, Xiang Zhang (2024)

[39] Dispformer: A Dual Attention Transformer with Denoising for Irregular Clinical Time Series Classification PDF

Junjie Zhang, Xuan Zang, Hao Chen, Xiaowei Yan, Buzhou Tang (2044)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Core Token Aggregation-Redistribution (CoTAR) module

[63] Softs: Efficient multivariate time series forecasting with series-core fusion PDF

Can Refute

[1] Medformer: A multi-granularity patching transformer for medical time-series classification PDF

Cannot Refute

[61] CPAT: cross-patch aggregated transformer for time series forecasting PDF

Cannot Refute

[62] PeT-KeyStAtion: Parameter-efficient Transformer with Keypoint-guided Spatial-temporal Aggregation for Video-based Person Re-identification PDF

Cannot Refute

[64] Transformer models for land cover classification with satellite image time series PDF

Cannot Refute

[65] SCAT: A Time Series Forecasting with Spectral Central Alternating Transformers PDF

Cannot Refute

[66] Knowledge Aggregation Transformer Network for Multivariate Time Series Classification PDF

Cannot Refute

[67] Many minds, one goal: Time series forecasting via sub-task specialization and inter-agent cooperation PDF

Cannot Refute

[68] Split Federated Learning for Real-Time Aerial Video Event Recognition in UAV-Based Geospatial Monitoring PDF

Cannot Refute

[69] Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos PDF

Cannot Refute

Contribution

TeCh framework with Adaptive Dual Tokenization

[72] MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection PDF

Can Refute

[33] MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing PDF

Cannot Refute

[70] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers PDF

Cannot Refute

[71] Empowering time series analysis with large language models: A survey PDF

Cannot Refute

[73] ODTrack: Online Dense Temporal Token Learning for Visual Tracking PDF

Cannot Refute

[74] Multimodal temporal context network for tracking dynamic changes in emotion PDF

Cannot Refute

[75] Adaptive Tokenization Transformer: Enhancing Irregularly Sampled Multivariate Time-Series Analysis PDF

Cannot Refute

[76] Multiple-resolution tokenization for time series forecasting with an application to pricing PDF

Cannot Refute

[77] Occsora: 4d occupancy generation models as world simulators for autonomous driving PDF

Cannot Refute

[78] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space PDF

Cannot Refute

Contribution

Identification of structural mismatch between attention and medical time series

[51] Novel graph-based centralized and decentralized approaches for early AKI prediction PDF

Cannot Refute

[52] dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis PDF

Cannot Refute

[53] A federated content distribution system to build health data synchronization services PDF

Cannot Refute

[54] VFLGAN-TS: vertical federated learning-based generative adversarial networks for publication of vertically partitioned time-series data PDF

Cannot Refute

[55] Fine-Tuning Foundation Models with Federated Learning for Privacy Preserving Medical Time Series Forecasting PDF

Cannot Refute

[56] Unsupervised Multivariate Time Series Anomaly Detection by Feature Decoupling in Federated Learning Scenarios PDF

Cannot Refute

[57] Elevating security and disease forecasting in smart healthcare through artificial neural synchronized federated learning PDF

Cannot Refute

[58] Federated Block-Term Tensor Regression for decentralised data analysis in healthcare PDF

Cannot Refute

[59] Decentralized federated learning for epileptic seizures detection in low-power wearable systems PDF

Cannot Refute

[60] Automatic detection of congestive heart failure based on multiscale residual UNet++: From centralized learning to federated learning PDF

Cannot Refute

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Medformer: A multi-granularity patching transformer for medical time-series classification PDF

[39] Dispformer: A Dual Attention Transformer with Denoising for Irregular Clinical Time Series Classification PDF

Contribution Analysis

Core Token Aggregation-Redistribution (CoTAR) module

[63] Softs: Efficient multivariate time series forecasting with series-core fusion PDF

[1] Medformer: A multi-granularity patching transformer for medical time-series classification PDF

[61] CPAT: cross-patch aggregated transformer for time series forecasting PDF

[62] PeT-KeyStAtion: Parameter-efficient Transformer with Keypoint-guided Spatial-temporal Aggregation for Video-based Person Re-identification PDF

[64] Transformer models for land cover classification with satellite image time series PDF

[65] SCAT: A Time Series Forecasting with Spectral Central Alternating Transformers PDF

[66] Knowledge Aggregation Transformer Network for Multivariate Time Series Classification PDF

[67] Many minds, one goal: Time series forecasting via sub-task specialization and inter-agent cooperation PDF

[68] Split Federated Learning for Real-Time Aerial Video Event Recognition in UAV-Based Geospatial Monitoring PDF

[69] Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos PDF

TeCh framework with Adaptive Dual Tokenization

[72] MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection PDF

[33] MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing PDF

[70] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers PDF

[71] Empowering time series analysis with large language models: A survey PDF

[73] ODTrack: Online Dense Temporal Token Learning for Visual Tracking PDF

[74] Multimodal temporal context network for tracking dynamic changes in emotion PDF

[75] Adaptive Tokenization Transformer: Enhancing Irregularly Sampled Multivariate Time-Series Analysis PDF

[76] Multiple-resolution tokenization for time series forecasting with an application to pricing PDF

[77] Occsora: 4d occupancy generation models as world simulators for autonomous driving PDF

[78] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space PDF

Identification of structural mismatch between attention and medical time series

[51] Novel graph-based centralized and decentralized approaches for early AKI prediction PDF

[52] dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis PDF

[53] A federated content distribution system to build health data synchronization services PDF

[54] VFLGAN-TS: vertical federated learning-based generative adversarial networks for publication of vertically partitioned time-series data PDF

[55] Fine-Tuning Foundation Models with Federated Learning for Privacy Preserving Medical Time Series Forecasting PDF

[56] Unsupervised Multivariate Time Series Anomaly Detection by Feature Decoupling in Federated Learning Scenarios PDF

[57] Elevating security and disease forecasting in smart healthcare through artificial neural synchronized federated learning PDF

[58] Federated Block-Term Tensor Regression for decentralised data analysis in healthcare PDF

[59] Decentralized federated learning for epileptic seizures detection in low-power wearable systems PDF

[60] Automatic detection of congestive heart failure based on multiscale residual UNet++: From centralized learning to federated learning PDF

Table of Contents