TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

brainencodingmultimodal

Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at \url{https://anonymous.4open.science/r/algonauts-2025-C63E}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TRIBE introduces a multimodal deep neural network for predicting whole-brain fMRI responses to video stimuli across multiple subjects. The paper occupies a unique position in the taxonomy: it is the sole member of the 'Benchmark Model: TRIBE' leaf, explicitly designated as the reference against which other approaches are compared. This placement reflects its role as a comprehensive integration point rather than a contribution to a crowded subfield. The taxonomy contains 50 papers across 25 leaf nodes, indicating a moderately populated research area with diverse specialized directions.

The taxonomy reveals that TRIBE sits at the intersection of multiple research branches. Its closest neighbors include 'Transformer-Based Multimodal Fusion' (3 papers), 'Foundation Model Adaptation and Prompt Learning' (3 papers), and 'Cross-Subject Alignment and Decoding' (3 papers). These adjacent leaves address components of TRIBE's approach—transformer architectures, pretrained model integration, and cross-subject generalization—but typically focus on one aspect rather than synthesizing all three. The taxonomy's scope notes clarify that TRIBE differs from unimodal models and static fusion approaches by explicitly modeling temporal dynamics and nonlinear multimodal integration.

Among 23 candidates examined across three contributions, no clearly refuting prior work was identified. The first contribution (multimodal, multi-subject, whole-brain prediction) examined 5 candidates with 0 refutations; the second (nonlinear multimodal integration via transformer) examined 10 candidates with 0 refutations; and the third (Algonauts 2025 competition performance) examined 8 candidates with 0 refutations. This suggests that within the limited search scope, no single prior work combines all three elements—multimodal integration, cross-subject generalization, and whole-brain coverage—in the manner TRIBE proposes. However, the search examined only top-K semantic matches, not an exhaustive literature review.

Based on the limited analysis of 23 candidates, TRIBE appears to occupy a synthesis position, integrating architectural elements and objectives addressed separately in prior work. The taxonomy structure shows that while individual components (transformers, foundation models, cross-subject methods) have precedents, their combination for whole-brain multimodal prediction represents a less explored configuration. The absence of refuting candidates among those examined suggests novelty in the integrated approach, though the restricted search scope means potentially relevant work outside the top-K matches remains unexamined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Predicting whole-brain fMRI responses to multimodal video stimuli. This field has grown around the challenge of modeling how the brain processes rich, naturalistic audiovisual content such as movies or narratives. The taxonomy reveals a landscape organized into several major branches: one cluster focuses on multimodal feature extraction and integration architectures, exploring how to combine visual, auditory, and linguistic signals into unified representations; another emphasizes cross-modal and cross-subject generalization, addressing the variability in neural responses across individuals and stimulus modalities. Additional branches cover semantic and linguistic representation modeling, temporal dynamics and sequential prediction, audiovisual integration and multisensory processing, affective and social cognition processing, and specialized cognitive tasks. Computational methods and model interpretation form a methodological backbone, while datasets and experimental paradigms provide the empirical foundation. Individual differences and clinical applications extend the work toward personalized and translational goals, and meta-analyses evaluate the broader paradigm landscape. Representative efforts include datasets like Narratives Dataset[12] and CineBrain Dataset[28], multimodal architectures such as Multimodal Seq2Seq Transformer[3] and Multimodal Recurrent Ensembles[4], and interpretive frameworks like Interpreting Video Transformers[23]. Several active lines of work highlight key trade-offs and open questions. One strand investigates how to scale and generalize models across subjects and modalities, as seen in Cross Subject Alignment[14] and Modality Agnostic Decoding[20], balancing the need for subject-specific tuning against the goal of universal encoding models. Another explores the interplay between low-level sensory features and high-level semantic or affective content, with studies like Emotional Arousal Networks[13] and Vision Language Social Brain[11] probing how emotion and social cognition emerge from multimodal integration. Within this landscape, TRIBE[0] serves as a benchmark model that synthesizes many of these themes, offering a comprehensive architecture for predicting whole-brain responses to complex video stimuli. Its emphasis on integrating temporal dynamics, multimodal features, and cross-subject generalization places it at the intersection of several branches, closely related to works like Multimodal Seq2Seq Transformer[3] and Comprehensive Neural Representations[8], yet distinguished by its focus on establishing a unified benchmark framework that can be compared against diverse specialized approaches.

Claimed Contributions

TRIBE: first deep neural network for multimodal, multi-subject, whole-brain fMRI prediction

5 retrieved papers

The authors present TRIBE, a novel deep learning pipeline that predicts fMRI brain responses to videos by integrating text, audio, and video modalities in an end-to-end manner across the whole brain and multiple subjects, addressing limitations of linearity, subject-specificity, and unimodality in existing encoding approaches.

5 retrieved papers

Nonlinear multimodal integration via transformer architecture

10 retrieved papers

The model employs a transformer encoder to learn nonlinear mappings between pretrained multimodal representations (from text, audio, and video foundational models) and brain responses, rather than relying on linear ridge regression used in prior work.

10 retrieved papers

State-of-the-art performance in Algonauts 2025 competition

8 retrieved papers

TRIBE achieved first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors, demonstrating that multimodality benefits are highest in associative cortices and validating the importance of the multimodal, multisubject, and nonlinear design.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TRIBE: first deep neural network for multimodal, multi-subject, whole-brain fMRI prediction

[3] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli PDF

Cannot Refute

[51] Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans PDF

Cannot Refute

[52] See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI PDF

Cannot Refute

[53] EEG-to-fMRI Neuroimaging Cross Modal Synthesis in Python PDF

Cannot Refute

[54] Transmodal Analysis of Neural Signals PDF

Cannot Refute

Contribution

Nonlinear multimodal integration via transformer architecture

[3] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli PDF

Cannot Refute

[55] Attention bottlenecks for multimodal fusion PDF

Cannot Refute

[56] IQFormer: A novel transformer-based model with multi-modality fusion for automatic modulation recognition PDF

Cannot Refute

[57] Multi-modal brain encoding models for multi-modal stimuli PDF

Cannot Refute

[58] A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer's disease PDF

Cannot Refute

[59] Attention-based convolutional neural network with multi-modal temporal information fusion for motor imagery EEG decoding PDF

Cannot Refute

[60] Frequencyâspecific dualâattention based adversarial network for blood oxygen levelâdependent time series prediction PDF

Cannot Refute

[61] BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity PDF

Cannot Refute

[62] Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity PDF

Cannot Refute

[63] Transformer-based multimodal information fusion for facial expression analysis PDF

Cannot Refute

Contribution

State-of-the-art performance in Algonauts 2025 competition

[4] Multimodal recurrent ensembles for predicting brain responses to naturalistic movies (algonauts 2025) PDF

Cannot Refute

[64] Multimodal activity in the parietal cortex PDF

Cannot Refute

[65] Dynamics of alpha suppression and enhancement may be related to resource competition in cross-modal cortical regions. PDF

Cannot Refute

[66] Development of a Large-Scale Integrated Neurocognitive Architecture PDF

Cannot Refute

[67] Self-organizing neural maps for multi-modal associations PDF

Cannot Refute

[68] Visuotactile Integration and Cross-Modal Attention in the Human Brain PDF

Cannot Refute

[69] scale, STP is involved in multisensory integration, while PDF

Cannot Refute

[70] The Co-occurrence of Multisensory Facilitation and Competition in the Human Brain and its Impact on Aging PDF

Cannot Refute

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

TRIBE: first deep neural network for multimodal, multi-subject, whole-brain fMRI prediction

[3] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli PDF

[51] Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans PDF

[52] See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI PDF

[53] EEG-to-fMRI Neuroimaging Cross Modal Synthesis in Python PDF

[54] Transmodal Analysis of Neural Signals PDF

Nonlinear multimodal integration via transformer architecture

[3] A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli PDF

[55] Attention bottlenecks for multimodal fusion PDF

[56] IQFormer: A novel transformer-based model with multi-modality fusion for automatic modulation recognition PDF

[57] Multi-modal brain encoding models for multi-modal stimuli PDF

[58] A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer's disease PDF

[59] Attention-based convolutional neural network with multi-modal temporal information fusion for motor imagery EEG decoding PDF

[60] Frequencyâspecific dualâattention based adversarial network for blood oxygen levelâdependent time series prediction PDF

[61] BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity PDF

[62] Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity PDF

[63] Transformer-based multimodal information fusion for facial expression analysis PDF

State-of-the-art performance in Algonauts 2025 competition

[4] Multimodal recurrent ensembles for predicting brain responses to naturalistic movies (algonauts 2025) PDF

[64] Multimodal activity in the parietal cortex PDF

[65] Dynamics of alpha suppression and enhancement may be related to resource competition in cross-modal cortical regions. PDF

[66] Development of a Large-Scale Integrated Neurocognitive Architecture PDF

[67] Self-organizing neural maps for multi-modal associations PDF

[68] Visuotactile Integration and Cross-Modal Attention in the Human Brain PDF

[69] scale, STP is involved in multisensory integration, while PDF

[70] The Co-occurrence of Multisensory Facilitation and Competition in the Human Brain and its Impact on Aging PDF

Table of Contents

[60] Frequencyâspecific dualâattention based adversarial network for blood oxygen levelâdependent time series prediction PDF