Are EEG Foundation Models Worth It? Comparative Evaluation with Traditional Decoders in Diverse BCI Tasks

ICLR 2026 Conference SubmissionAnonymous Authors
Foundation ModelBrain–Computer InterfaceEEGBenchmark
Abstract:

Foundation models have recently emerged as a promising approach for learning generalizable EEG representations for brain–computer interfaces (BCIs). Yet, their true advantages over traditional methods—particularly classical non-neural approaches—remain unclear. In this work, we present a comprehensive benchmark of state-of-the-art EEG foundation models, evaluated across diverse datasets, decoding tasks, and six evaluation protocols, with rigorous statistical testing. We introduce spatiotemporal EEGFormer (ST-EEGFormer), a simple yet effective Vision Transformer (ViT)-based baseline, pre-trained solely with masked autoencoding (MAE) on over 8M EEG segments. Our results show that while fine-tuned foundation models perform well in data-rich, population-level settings, they often fail to significantly outperform compact neural networks or even classical non-neural decoders in data-scarce scenarios. Furthermore, linear probing remains consistently weak, and performance varies greatly across downstream tasks, with no clear scaling law observed among neural network decoders. These findings expose a substantial gap between pre-training and downstream fine-tuning, often diminishing the benefits of complex pre-training tasks. We further identify hidden architectural factors that affect performance and emphasize the need for transparent, statistically rigorous evaluation. Overall, this study calls for community-wide efforts to construct large-scale EEG datasets and for fair, reproducible benchmarks to advance EEG foundation models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a comprehensive benchmark of EEG foundation models evaluated across diverse datasets and six evaluation protocols, alongside ST-EEGFormer, a Vision Transformer baseline pre-trained with masked autoencoding on over 8M EEG segments. It resides in the 'Comprehensive Benchmarking Frameworks' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. This positioning reflects the emerging nature of systematic foundation model evaluation in EEG-based BCIs, where rigorous multi-protocol benchmarking remains uncommon despite growing interest in large-scale pre-training approaches.

The taxonomy reveals substantial activity in adjacent areas: the 'EEG Foundation Model Architectures and Pre-training Strategies' branch contains 16 papers across transformer-based, alternative, and hybrid approaches, while 'Application-Specific Adaptations' includes 13 papers targeting motor imagery, language decoding, and clinical tasks. The 'Comparative Analysis and Performance Assessment' leaf, a sibling category, houses four papers examining foundation model capabilities versus traditional methods. The paper bridges these domains by systematically evaluating architectural innovations from the foundation model branch against classical baselines, addressing the gap between pre-training research and practical deployment concerns highlighted in the comparative analysis cluster.

Among 30 candidates examined, Contribution A (comprehensive benchmark framework) shows one refutable candidate from 10 examined, suggesting some prior benchmarking efforts exist but remain limited in scope. Contribution B (ST-EEGFormer architecture) encountered no refutations across 10 candidates, indicating architectural novelty within the examined sample. Contribution C (empirical findings on foundation model limitations) identified four refutable candidates from 10 examined, reflecting existing discourse on classical baseline competitiveness. The search scale is modest, focusing on top-K semantic matches rather than exhaustive coverage, meaning these statistics characterize the immediate research neighborhood rather than the entire field.

Based on the limited search scope, the work appears to occupy a sparsely populated benchmarking niche while engaging with well-established debates about foundation model utility. The taxonomy structure confirms that systematic multi-protocol evaluation remains underexplored compared to architecture development, though the empirical findings align with emerging skepticism documented in comparative analysis literature. The analysis covers top-30 semantic matches and does not claim exhaustive prior work coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: Benchmarking EEG foundation models for brain-computer interface decoding tasks. The field has evolved from traditional neural network architectures for EEG decoding toward large-scale foundation models that leverage pre-training on diverse datasets. The taxonomy reflects this shift through several main branches: one focused on EEG Foundation Model Architectures and Pre-training Strategies, where works like Neuro GPT[5] and Large Brain Model[14] explore transformer-based and generative approaches; another on Application-Specific Adaptations and Paradigm-Focused Models, addressing specialized BCI paradigms such as motor imagery or event-related potentials; a branch on Transfer Learning and Cross-Domain Adaptation, examining how models generalize across subjects and tasks; and a branch on Evaluation, Benchmarking, and Comparative Analysis, which systematically assesses model performance. Traditional Neural Network Architectures for EEG Decoding and Survey and Review Literature branches provide historical context and synthesize emerging trends, with reviews like LLM EEG Survey[17] and Brain Decoding Survey[45] offering broad perspectives. Recent efforts have concentrated on establishing rigorous evaluation protocols and understanding the practical value of foundation models in real-world BCI scenarios. Works such as Adabrain Bench[1] and Benchmarking ERP Analysis[49] provide structured frameworks for comparing models across multiple decoding tasks, while EEG Foundation Worth[0] sits squarely within this comprehensive benchmarking cluster. Unlike narrower evaluations that focus on single paradigms, EEG Foundation Worth[0] emphasizes systematic assessment of whether foundation models deliver meaningful improvements over task-specific baselines, echoing concerns raised in Adabrain Bench[1] about generalization and calibration efficiency. This contrasts with application-driven studies like Decoding Pain[3], which prioritize domain-specific performance. The central question across these benchmarking efforts remains whether the computational overhead and data requirements of foundation models justify their adoption, particularly when traditional architectures like EEGNet[29] continue to perform competitively in constrained settings.

Claimed Contributions

Comprehensive benchmark of EEG foundation models with six-dimensional evaluation framework

The authors introduce a systematic evaluation framework spanning six protocols (Population, Per-Subject Self, Per-Subject Transfer, LOO Zero-Shot, LOO Fine-Tune, and LOO Drop) to assess foundation models against classical neural and non-neural decoders across seven classification and two regression tasks, involving training over 20,000 models with statistical rigor.

10 retrieved papers
Can Refute
ST-EEGFormer: Vision Transformer-based foundation model with masked autoencoding

The authors propose ST-EEGFormer, a transparent baseline foundation model built on Vision Transformer architecture and pre-trained using only masked autoencoding on raw EEG signals from more than 8 million segments, demonstrating that simple pre-training can be effective contrary to prevailing views.

10 retrieved papers
Empirical findings on foundation model limitations and classical baseline competitiveness

The study reveals that foundation models do not universally outperform simpler approaches, particularly in low-data regimes; linear probing remains consistently weak, performance varies greatly across tasks, and no clear scaling law emerges among neural decoders, exposing gaps between pre-training and downstream fine-tuning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Comprehensive benchmark of EEG foundation models with six-dimensional evaluation framework

The authors introduce a systematic evaluation framework spanning six protocols (Population, Per-Subject Self, Per-Subject Transfer, LOO Zero-Shot, LOO Fine-Tune, and LOO Drop) to assess foundation models against classical neural and non-neural decoders across seven classification and two regression tasks, involving training over 20,000 models with statistical rigor.

Contribution

ST-EEGFormer: Vision Transformer-based foundation model with masked autoencoding

The authors propose ST-EEGFormer, a transparent baseline foundation model built on Vision Transformer architecture and pre-trained using only masked autoencoding on raw EEG signals from more than 8 million segments, demonstrating that simple pre-training can be effective contrary to prevailing views.

Contribution

Empirical findings on foundation model limitations and classical baseline competitiveness

The study reveals that foundation models do not universally outperform simpler approaches, particularly in low-data regimes; linear probing remains consistently weak, performance varies greatly across tasks, and no clear scaling law emerges among neural decoders, exposing gaps between pre-training and downstream fine-tuning.