CerebraGloss: Instruction-Tuning a Large Vision-Language Model for Fine-Grained Clinical EEG Interpretation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

large vision-language modelinstruction-tuningEEGclinical

Interpreting clinical electroencephalography (EEG) is a laborious, subjective process, and existing computational models are limited to narrow classification tasks rather than holistic interpretation. A key bottleneck for applying powerful Large Vision-Language Models (LVLMs) to this domain is the scarcity of datasets pairing EEG visualizations with fine-grained, expert-level annotations. We address this by introducing CerebraGloss, an instruction-tuned LVLM for nuanced EEG interpretation. We first introduce a novel, automated data generation pipeline, featuring a bespoke YOLO-based waveform detector, to programmatically create a large-scale corpus of EEG-text instruction data. Using this data, we develop CerebraGloss, the first model of its kind capable of unified, generative analysis—performing tasks from detailed waveform description to multi-turn, context-aware dialogue. To evaluate this new capability, we construct and release CerebraGloss-Bench, a comprehensive benchmark for open-ended EEG interpretation. CerebraGloss demonstrates strong performance, surpassing leading LVLMs, including proprietary models like GPT-5, on this benchmark and achieving a new state-of-the-art on the TUSZ seizure detection task. We will open-source our model, benchmark, and tools to foster progress in developing general-purpose neuro-intelligent systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

CerebraGloss introduces an instruction-tuned large vision-language model for fine-grained clinical EEG interpretation, positioning itself within the 'Instruction-Tuned Clinical Interpretation Systems' leaf of the taxonomy. This leaf contains only two papers total, including the original work and one sibling (EEG-GPT). This represents a notably sparse research direction within the broader field of vision-language models for EEG analysis, which encompasses twenty-eight papers across multiple architectural and application-focused branches. The scarcity suggests this specific approach—combining instruction tuning with generative clinical interpretation—is relatively nascent.

The taxonomy reveals that neighboring leaves pursue complementary strategies: 'Multimodal Alignment and Pretraining Frameworks' (five papers) emphasizes contrastive learning without instruction tuning, while 'Hierarchical Vision-Language Integration' (two papers) explores multi-level feature alignment. Clinical application domains such as epilepsy analysis and neurocritical care monitoring focus on narrow diagnostic tasks rather than holistic interpretation. CerebraGloss diverges from these directions by targeting unified, generative analysis across multiple EEG interpretation tasks, bridging architectural innovation with broad clinical applicability. The taxonomy's scope notes clarify that instruction-tuned systems explicitly exclude pretraining-only methods, positioning this work at the intersection of model architecture and clinical deployment.

Among twenty-five candidates examined, the automated data generation pipeline (Contribution A) shows overlap with two prior works, while the instruction-tuned model itself (Contribution B) and the benchmark (Contribution C) each examined ten candidates with no clear refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. Contribution A's refutable candidates suggest that YOLO-based waveform detection or automated EEG data generation may have precedents, whereas the instruction-tuned LVLM and open-ended benchmark appear more distinctive within the examined literature. The sibling paper EEG-GPT likely represents the closest conceptual overlap, though detailed comparison requires deeper analysis.

Based on the limited search of twenty-five candidates, CerebraGloss appears to occupy a sparsely populated research direction with only one direct sibling in the taxonomy. The instruction-tuned model and benchmark contributions show no clear prior work among examined candidates, suggesting potential novelty, though the data generation pipeline has identifiable precedents. This assessment reflects the scope of top-K semantic search and does not preclude additional relevant work outside the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: fine-grained clinical EEG interpretation using vision-language models. The field has evolved around several complementary directions. Vision-Language Model Architectures for EEG Analysis explores how to adapt multimodal frameworks—ranging from contrastive learning approaches like EEG-CLIP[22] to instruction-tuned systems such as EEG-GPT[9]—for bridging neural signals and natural language. EEG Representation Learning and Feature Extraction focuses on encoding raw or preprocessed EEG into meaningful embeddings, often leveraging self-supervised or contrastive objectives (e.g., DistilCLIP-EEG[5]). Clinical Application Domains targets specific diagnostic tasks including seizure prediction (Seizure Prediction Transformer[21]), sleep staging (EEG-VLM Sleep Stage[20]), and pathology detection (EEG Language Pathology[3]). Brain Decoding and Neural-to-Text Generation investigates how to translate neural activity into coherent textual descriptions, exemplified by Thought2Text[2] and Wave2Word[15]. Finally, Methodological Frameworks and Evaluation addresses benchmarking, few-shot learning (ADHD Few-Shot[14]), and quality assurance (AutocleanEEG ICVision[13]), ensuring robust and generalizable models. Recent work has intensified around instruction-tuned clinical interpretation systems that combine large language models with EEG encoders to produce human-readable diagnostic narratives. CerebraGloss[0] sits squarely within this emerging cluster, emphasizing fine-grained clinical reasoning over raw signal patterns. It shares conceptual ground with EEG-GPT[9], which similarly employs instruction tuning for interpretable EEG analysis, yet CerebraGloss[0] appears to push further toward nuanced clinical glossaries and detailed report generation. In contrast, approaches like EEG-CLIP[22] prioritize contrastive alignment between EEG and text embeddings without explicit instruction following, while Thought2Text[2] focuses on decoding cognitive states into natural language rather than clinical diagnostics. A key open question remains how to balance model interpretability with diagnostic accuracy, especially when scaling to diverse pathologies and limited annotated data.

Claimed Contributions

Novel automated data generation pipeline with YOLO-based waveform detector

Can Refute

5 retrieved papers

The authors develop an automated pipeline that generates structured clinical annotations from raw EEG signals. A key component is CerebraGloss-YOLO, a bespoke object detection model designed to localize and classify nine critical waveform types in multi-channel time-series data, enabling large-scale instruction dataset creation.

5 retrieved papers

Can Refute

CerebraGloss: instruction-tuned LVLM for generative EEG interpretation

10 retrieved papers

The authors present CerebraGloss, the first large vision-language model capable of unified, generative EEG analysis. Through a two-stage training curriculum using their generated instruction data, the model performs tasks ranging from detailed waveform description to multi-turn dialogue, shifting from narrow classification to comprehensive interpretation.

10 retrieved papers

CerebraGloss-Bench: comprehensive benchmark for open-ended EEG interpretation

10 retrieved papers

The authors introduce CerebraGloss-Bench, the first benchmark designed for open-ended clinical EEG interpretation and multi-class waveform object detection. It comprises 90 challenging segments with expert-validated annotations across four evaluation formats: free-text descriptions, complex multiple-choice questions, conversational QA pairs, and dense bounding box annotations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] EEG-GPT: Exploring Capabilities of Large Language Models for EEG Classification and Interpretation PDF

Alaa Ahmed, Jonathan Kim, Ahmed Alaa, Danilo Bernardo (2024) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel automated data generation pipeline with YOLO-based waveform detector

[31] DOSED: A deep learning approach to detect multiple sleep micro-events in EEG signal PDF

Can Refute

[32] Detection and location of EEG events using deep learning visual inspection PDF

Can Refute

[29] Visual identification of sleep spindles in EEG waveform images using deep learning object detection (YOLOv4 vs YOLOX) PDF

Cannot Refute

[30] Detection of K-complexes in EEG signals using deep transfer learning and YOLOv3 PDF

Cannot Refute

[33] of thesis Object detection in engineering diagrams with scarce training data PDF

Cannot Refute

Contribution

CerebraGloss: instruction-tuned LVLM for generative EEG interpretation

[34] Vision-language models in ecg interpretation: An exploratory study PDF

Cannot Refute

[35] Gem: Empowering mllm for grounded ecg understanding with time series and images PDF

Cannot Refute

[36] A Survey of Multimodal Large Language Models in Biomedical Engineering and Healthcare PDF

Cannot Refute

[37] MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation PDF

Cannot Refute

[38] Agentic large-language-model systems in medicine: A systematic review and taxonomy PDF

Cannot Refute

[39] Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data PDF

Cannot Refute

[40] DiagECG: An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization PDF

Cannot Refute

[41] Can In-Context Learning Enable Large Vision Language Models to Detect ECG PDF

Cannot Refute

[42] Can In-Context Learning Enable Large Vision Language Models to Detect ECG Abnormalities? PDF

Cannot Refute

[43] Standardization of Neuromuscular Reflex Analysis - Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM Enabled Decision Support System PDF

Cannot Refute

Contribution

CerebraGloss-Bench: comprehensive benchmark for open-ended EEG interpretation

[44] Classification of ADHD and Healthy Children Using EEG Based Multi-Band Spatial Features Enhancement PDF

Cannot Refute

[45] EEGformer: A transformerâbased brain activity classification method using EEG signal PDF

Cannot Refute

[46] FCANâXGBoost: a novel hybrid model for EEG emotion recognition PDF

Cannot Refute

[47] Identifying optimal channels and features for multi-participant motor imagery experiments across a participant's multi-day multi-class EEG data PDF

Cannot Refute

[48] Deep CNNâbased classification of motor imagery tasks from EEG signals using 2D wavelet transformed images of adaptively reconstructed signals from MVMD â¦ PDF

Cannot Refute

[49] Epileptic signal classification with deep EEG features by stacked CNNs PDF

Cannot Refute

[50] EEG-Clip: Finetune Clip Model for EEG Classification PDF

Cannot Refute

[51] Multiclass covert speech classification using extreme learning machine PDF

Cannot Refute

[52] Clinically Calibrated Machine Learning Benchmarks for Large-Scale Multi-Disorder EEG Classification PDF

Cannot Refute

[53] Epileptic seizures detection in EEGs blending frequency domain with information gain technique PDF

Cannot Refute

CerebraGloss: Instruction-Tuning a Large Vision-Language Model for Fine-Grained Clinical EEG Interpretation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] EEG-GPT: Exploring Capabilities of Large Language Models for EEG Classification and Interpretation PDF

Contribution Analysis

Novel automated data generation pipeline with YOLO-based waveform detector

[31] DOSED: A deep learning approach to detect multiple sleep micro-events in EEG signal PDF

[32] Detection and location of EEG events using deep learning visual inspection PDF

[29] Visual identification of sleep spindles in EEG waveform images using deep learning object detection (YOLOv4 vs YOLOX) PDF

[30] Detection of K-complexes in EEG signals using deep transfer learning and YOLOv3 PDF

[33] of thesis Object detection in engineering diagrams with scarce training data PDF

CerebraGloss: instruction-tuned LVLM for generative EEG interpretation

[34] Vision-language models in ecg interpretation: An exploratory study PDF

[35] Gem: Empowering mllm for grounded ecg understanding with time series and images PDF

[36] A Survey of Multimodal Large Language Models in Biomedical Engineering and Healthcare PDF

[37] MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation PDF

[38] Agentic large-language-model systems in medicine: A systematic review and taxonomy PDF

[39] Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data PDF

[40] DiagECG: An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization PDF

[41] Can In-Context Learning Enable Large Vision Language Models to Detect ECG PDF

[42] Can In-Context Learning Enable Large Vision Language Models to Detect ECG Abnormalities? PDF

[43] Standardization of Neuromuscular Reflex Analysis - Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM Enabled Decision Support System PDF

CerebraGloss-Bench: comprehensive benchmark for open-ended EEG interpretation

[44] Classification of ADHD and Healthy Children Using EEG Based Multi-Band Spatial Features Enhancement PDF

[45] EEGformer: A transformerâbased brain activity classification method using EEG signal PDF

[46] FCANâXGBoost: a novel hybrid model for EEG emotion recognition PDF

[47] Identifying optimal channels and features for multi-participant motor imagery experiments across a participant's multi-day multi-class EEG data PDF

[48] Deep CNNâbased classification of motor imagery tasks from EEG signals using 2D wavelet transformed images of adaptively reconstructed signals from MVMD â¦ PDF

[49] Epileptic signal classification with deep EEG features by stacked CNNs PDF

[50] EEG-Clip: Finetune Clip Model for EEG Classification PDF

[51] Multiclass covert speech classification using extreme learning machine PDF

[52] Clinically Calibrated Machine Learning Benchmarks for Large-Scale Multi-Disorder EEG Classification PDF

[53] Epileptic seizures detection in EEGs blending frequency domain with information gain technique PDF

Table of Contents

[45] EEGformer: A transformerâbased brain activity classification method using EEG signal PDF

[46] FCANâXGBoost: a novel hybrid model for EEG emotion recognition PDF

[48] Deep CNNâbased classification of motor imagery tasks from EEG signals using 2D wavelet transformed images of adaptively reconstructed signals from MVMD â¦ PDF