TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction
Overview
Overall Novelty Assessment
TRIBE introduces a multimodal deep neural network for predicting whole-brain fMRI responses to video stimuli across multiple subjects. The paper occupies a unique position in the taxonomy: it is the sole member of the 'Benchmark Model: TRIBE' leaf, explicitly designated as the reference against which other approaches are compared. This placement reflects its role as a comprehensive integration point rather than a contribution to a crowded subfield. The taxonomy contains 50 papers across 25 leaf nodes, indicating a moderately populated research area with diverse specialized directions.
The taxonomy reveals that TRIBE sits at the intersection of multiple research branches. Its closest neighbors include 'Transformer-Based Multimodal Fusion' (3 papers), 'Foundation Model Adaptation and Prompt Learning' (3 papers), and 'Cross-Subject Alignment and Decoding' (3 papers). These adjacent leaves address components of TRIBE's approach—transformer architectures, pretrained model integration, and cross-subject generalization—but typically focus on one aspect rather than synthesizing all three. The taxonomy's scope notes clarify that TRIBE differs from unimodal models and static fusion approaches by explicitly modeling temporal dynamics and nonlinear multimodal integration.
Among 23 candidates examined across three contributions, no clearly refuting prior work was identified. The first contribution (multimodal, multi-subject, whole-brain prediction) examined 5 candidates with 0 refutations; the second (nonlinear multimodal integration via transformer) examined 10 candidates with 0 refutations; and the third (Algonauts 2025 competition performance) examined 8 candidates with 0 refutations. This suggests that within the limited search scope, no single prior work combines all three elements—multimodal integration, cross-subject generalization, and whole-brain coverage—in the manner TRIBE proposes. However, the search examined only top-K semantic matches, not an exhaustive literature review.
Based on the limited analysis of 23 candidates, TRIBE appears to occupy a synthesis position, integrating architectural elements and objectives addressed separately in prior work. The taxonomy structure shows that while individual components (transformers, foundation models, cross-subject methods) have precedents, their combination for whole-brain multimodal prediction represents a less explored configuration. The absence of refuting candidates among those examined suggests novelty in the integrated approach, though the restricted search scope means potentially relevant work outside the top-K matches remains unexamined.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present TRIBE, a novel deep learning pipeline that predicts fMRI brain responses to videos by integrating text, audio, and video modalities in an end-to-end manner across the whole brain and multiple subjects, addressing limitations of linearity, subject-specificity, and unimodality in existing encoding approaches.
The model employs a transformer encoder to learn nonlinear mappings between pretrained multimodal representations (from text, audio, and video foundational models) and brain responses, rather than relying on linear ridge regression used in prior work.
TRIBE achieved first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors, demonstrating that multimodality benefits are highest in associative cortices and validating the importance of the multimodal, multisubject, and nonlinear design.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
TRIBE: first deep neural network for multimodal, multi-subject, whole-brain fMRI prediction
The authors present TRIBE, a novel deep learning pipeline that predicts fMRI brain responses to videos by integrating text, audio, and video modalities in an end-to-end manner across the whole brain and multiple subjects, addressing limitations of linearity, subject-specificity, and unimodality in existing encoding approaches.
[3] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli PDF
[51] Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans PDF
[52] See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI PDF
[53] EEG-to-fMRI Neuroimaging Cross Modal Synthesis in Python PDF
[54] Transmodal Analysis of Neural Signals PDF
Nonlinear multimodal integration via transformer architecture
The model employs a transformer encoder to learn nonlinear mappings between pretrained multimodal representations (from text, audio, and video foundational models) and brain responses, rather than relying on linear ridge regression used in prior work.
[3] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli PDF
[55] Attention bottlenecks for multimodal fusion PDF
[56] IQFormer: A novel transformer-based model with multi-modality fusion for automatic modulation recognition PDF
[57] Multi-modal brain encoding models for multi-modal stimuli PDF
[58] A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer's disease PDF
[59] Attention-based convolutional neural network with multi-modal temporal information fusion for motor imagery EEG decoding PDF
[60] Frequencyâspecific dualâattention based adversarial network for blood oxygen levelâdependent time series prediction PDF
[61] BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity PDF
[62] Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity PDF
[63] Transformer-based multimodal information fusion for facial expression analysis PDF
State-of-the-art performance in Algonauts 2025 competition
TRIBE achieved first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors, demonstrating that multimodality benefits are highest in associative cortices and validating the importance of the multimodal, multisubject, and nonlinear design.