CerebraGloss: Instruction-Tuning a Large Vision-Language Model for Fine-Grained Clinical EEG Interpretation

ICLR 2026 Conference SubmissionAnonymous Authors
large vision-language modelinstruction-tuningEEGclinical
Abstract:

Interpreting clinical electroencephalography (EEG) is a laborious, subjective process, and existing computational models are limited to narrow classification tasks rather than holistic interpretation. A key bottleneck for applying powerful Large Vision-Language Models (LVLMs) to this domain is the scarcity of datasets pairing EEG visualizations with fine-grained, expert-level annotations. We address this by introducing CerebraGloss, an instruction-tuned LVLM for nuanced EEG interpretation. We first introduce a novel, automated data generation pipeline, featuring a bespoke YOLO-based waveform detector, to programmatically create a large-scale corpus of EEG-text instruction data. Using this data, we develop CerebraGloss, the first model of its kind capable of unified, generative analysis—performing tasks from detailed waveform description to multi-turn, context-aware dialogue. To evaluate this new capability, we construct and release CerebraGloss-Bench, a comprehensive benchmark for open-ended EEG interpretation. CerebraGloss demonstrates strong performance, surpassing leading LVLMs, including proprietary models like GPT-5, on this benchmark and achieving a new state-of-the-art on the TUSZ seizure detection task. We will open-source our model, benchmark, and tools to foster progress in developing general-purpose neuro-intelligent systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

CerebraGloss introduces an instruction-tuned large vision-language model for fine-grained clinical EEG interpretation, positioning itself within the 'Instruction-Tuned Clinical Interpretation Systems' leaf of the taxonomy. This leaf contains only two papers total, including the original work and one sibling (EEG-GPT). This represents a notably sparse research direction within the broader field of vision-language models for EEG analysis, which encompasses twenty-eight papers across multiple architectural and application-focused branches. The scarcity suggests this specific approach—combining instruction tuning with generative clinical interpretation—is relatively nascent.

The taxonomy reveals that neighboring leaves pursue complementary strategies: 'Multimodal Alignment and Pretraining Frameworks' (five papers) emphasizes contrastive learning without instruction tuning, while 'Hierarchical Vision-Language Integration' (two papers) explores multi-level feature alignment. Clinical application domains such as epilepsy analysis and neurocritical care monitoring focus on narrow diagnostic tasks rather than holistic interpretation. CerebraGloss diverges from these directions by targeting unified, generative analysis across multiple EEG interpretation tasks, bridging architectural innovation with broad clinical applicability. The taxonomy's scope notes clarify that instruction-tuned systems explicitly exclude pretraining-only methods, positioning this work at the intersection of model architecture and clinical deployment.

Among twenty-five candidates examined, the automated data generation pipeline (Contribution A) shows overlap with two prior works, while the instruction-tuned model itself (Contribution B) and the benchmark (Contribution C) each examined ten candidates with no clear refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. Contribution A's refutable candidates suggest that YOLO-based waveform detection or automated EEG data generation may have precedents, whereas the instruction-tuned LVLM and open-ended benchmark appear more distinctive within the examined literature. The sibling paper EEG-GPT likely represents the closest conceptual overlap, though detailed comparison requires deeper analysis.

Based on the limited search of twenty-five candidates, CerebraGloss appears to occupy a sparsely populated research direction with only one direct sibling in the taxonomy. The instruction-tuned model and benchmark contributions show no clear prior work among examined candidates, suggesting potential novelty, though the data generation pipeline has identifiable precedents. This assessment reflects the scope of top-K semantic search and does not preclude additional relevant work outside the examined set.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
25
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: fine-grained clinical EEG interpretation using vision-language models. The field has evolved around several complementary directions. Vision-Language Model Architectures for EEG Analysis explores how to adapt multimodal frameworks—ranging from contrastive learning approaches like EEG-CLIP[22] to instruction-tuned systems such as EEG-GPT[9]—for bridging neural signals and natural language. EEG Representation Learning and Feature Extraction focuses on encoding raw or preprocessed EEG into meaningful embeddings, often leveraging self-supervised or contrastive objectives (e.g., DistilCLIP-EEG[5]). Clinical Application Domains targets specific diagnostic tasks including seizure prediction (Seizure Prediction Transformer[21]), sleep staging (EEG-VLM Sleep Stage[20]), and pathology detection (EEG Language Pathology[3]). Brain Decoding and Neural-to-Text Generation investigates how to translate neural activity into coherent textual descriptions, exemplified by Thought2Text[2] and Wave2Word[15]. Finally, Methodological Frameworks and Evaluation addresses benchmarking, few-shot learning (ADHD Few-Shot[14]), and quality assurance (AutocleanEEG ICVision[13]), ensuring robust and generalizable models. Recent work has intensified around instruction-tuned clinical interpretation systems that combine large language models with EEG encoders to produce human-readable diagnostic narratives. CerebraGloss[0] sits squarely within this emerging cluster, emphasizing fine-grained clinical reasoning over raw signal patterns. It shares conceptual ground with EEG-GPT[9], which similarly employs instruction tuning for interpretable EEG analysis, yet CerebraGloss[0] appears to push further toward nuanced clinical glossaries and detailed report generation. In contrast, approaches like EEG-CLIP[22] prioritize contrastive alignment between EEG and text embeddings without explicit instruction following, while Thought2Text[2] focuses on decoding cognitive states into natural language rather than clinical diagnostics. A key open question remains how to balance model interpretability with diagnostic accuracy, especially when scaling to diverse pathologies and limited annotated data.

Claimed Contributions

Novel automated data generation pipeline with YOLO-based waveform detector

The authors develop an automated pipeline that generates structured clinical annotations from raw EEG signals. A key component is CerebraGloss-YOLO, a bespoke object detection model designed to localize and classify nine critical waveform types in multi-channel time-series data, enabling large-scale instruction dataset creation.

5 retrieved papers
Can Refute
CerebraGloss: instruction-tuned LVLM for generative EEG interpretation

The authors present CerebraGloss, the first large vision-language model capable of unified, generative EEG analysis. Through a two-stage training curriculum using their generated instruction data, the model performs tasks ranging from detailed waveform description to multi-turn dialogue, shifting from narrow classification to comprehensive interpretation.

10 retrieved papers
CerebraGloss-Bench: comprehensive benchmark for open-ended EEG interpretation

The authors introduce CerebraGloss-Bench, the first benchmark designed for open-ended clinical EEG interpretation and multi-class waveform object detection. It comprises 90 challenging segments with expert-validated annotations across four evaluation formats: free-text descriptions, complex multiple-choice questions, conversational QA pairs, and dense bounding box annotations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel automated data generation pipeline with YOLO-based waveform detector

The authors develop an automated pipeline that generates structured clinical annotations from raw EEG signals. A key component is CerebraGloss-YOLO, a bespoke object detection model designed to localize and classify nine critical waveform types in multi-channel time-series data, enabling large-scale instruction dataset creation.

Contribution

CerebraGloss: instruction-tuned LVLM for generative EEG interpretation

The authors present CerebraGloss, the first large vision-language model capable of unified, generative EEG analysis. Through a two-stage training curriculum using their generated instruction data, the model performs tasks ranging from detailed waveform description to multi-turn dialogue, shifting from narrow classification to comprehensive interpretation.

Contribution

CerebraGloss-Bench: comprehensive benchmark for open-ended EEG interpretation

The authors introduce CerebraGloss-Bench, the first benchmark designed for open-ended clinical EEG interpretation and multi-class waveform object detection. It comprises 90 challenging segments with expert-validated annotations across four evaluation formats: free-text descriptions, complex multiple-choice questions, conversational QA pairs, and dense bounding box annotations.