Abstract:

Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at \url{https://anonymous.4open.science/r/algonauts-2025-C63E}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TRIBE introduces a multimodal deep neural network for predicting whole-brain fMRI responses to video stimuli across multiple subjects. The paper occupies a unique position in the taxonomy: it is the sole member of the 'Benchmark Model: TRIBE' leaf, explicitly designated as the reference against which other approaches are compared. This placement reflects its role as a comprehensive integration point rather than a contribution to a crowded subfield. The taxonomy contains 50 papers across 25 leaf nodes, indicating a moderately populated research area with diverse specialized directions.

The taxonomy reveals that TRIBE sits at the intersection of multiple research branches. Its closest neighbors include 'Transformer-Based Multimodal Fusion' (3 papers), 'Foundation Model Adaptation and Prompt Learning' (3 papers), and 'Cross-Subject Alignment and Decoding' (3 papers). These adjacent leaves address components of TRIBE's approach—transformer architectures, pretrained model integration, and cross-subject generalization—but typically focus on one aspect rather than synthesizing all three. The taxonomy's scope notes clarify that TRIBE differs from unimodal models and static fusion approaches by explicitly modeling temporal dynamics and nonlinear multimodal integration.

Among 23 candidates examined across three contributions, no clearly refuting prior work was identified. The first contribution (multimodal, multi-subject, whole-brain prediction) examined 5 candidates with 0 refutations; the second (nonlinear multimodal integration via transformer) examined 10 candidates with 0 refutations; and the third (Algonauts 2025 competition performance) examined 8 candidates with 0 refutations. This suggests that within the limited search scope, no single prior work combines all three elements—multimodal integration, cross-subject generalization, and whole-brain coverage—in the manner TRIBE proposes. However, the search examined only top-K semantic matches, not an exhaustive literature review.

Based on the limited analysis of 23 candidates, TRIBE appears to occupy a synthesis position, integrating architectural elements and objectives addressed separately in prior work. The taxonomy structure shows that while individual components (transformers, foundation models, cross-subject methods) have precedents, their combination for whole-brain multimodal prediction represents a less explored configuration. The absence of refuting candidates among those examined suggests novelty in the integrated approach, though the restricted search scope means potentially relevant work outside the top-K matches remains unexamined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Predicting whole-brain fMRI responses to multimodal video stimuli. This field has grown around the challenge of modeling how the brain processes rich, naturalistic audiovisual content such as movies or narratives. The taxonomy reveals a landscape organized into several major branches: one cluster focuses on multimodal feature extraction and integration architectures, exploring how to combine visual, auditory, and linguistic signals into unified representations; another emphasizes cross-modal and cross-subject generalization, addressing the variability in neural responses across individuals and stimulus modalities. Additional branches cover semantic and linguistic representation modeling, temporal dynamics and sequential prediction, audiovisual integration and multisensory processing, affective and social cognition processing, and specialized cognitive tasks. Computational methods and model interpretation form a methodological backbone, while datasets and experimental paradigms provide the empirical foundation. Individual differences and clinical applications extend the work toward personalized and translational goals, and meta-analyses evaluate the broader paradigm landscape. Representative efforts include datasets like Narratives Dataset[12] and CineBrain Dataset[28], multimodal architectures such as Multimodal Seq2Seq Transformer[3] and Multimodal Recurrent Ensembles[4], and interpretive frameworks like Interpreting Video Transformers[23]. Several active lines of work highlight key trade-offs and open questions. One strand investigates how to scale and generalize models across subjects and modalities, as seen in Cross Subject Alignment[14] and Modality Agnostic Decoding[20], balancing the need for subject-specific tuning against the goal of universal encoding models. Another explores the interplay between low-level sensory features and high-level semantic or affective content, with studies like Emotional Arousal Networks[13] and Vision Language Social Brain[11] probing how emotion and social cognition emerge from multimodal integration. Within this landscape, TRIBE[0] serves as a benchmark model that synthesizes many of these themes, offering a comprehensive architecture for predicting whole-brain responses to complex video stimuli. Its emphasis on integrating temporal dynamics, multimodal features, and cross-subject generalization places it at the intersection of several branches, closely related to works like Multimodal Seq2Seq Transformer[3] and Comprehensive Neural Representations[8], yet distinguished by its focus on establishing a unified benchmark framework that can be compared against diverse specialized approaches.

Claimed Contributions

TRIBE: first deep neural network for multimodal, multi-subject, whole-brain fMRI prediction

The authors present TRIBE, a novel deep learning pipeline that predicts fMRI brain responses to videos by integrating text, audio, and video modalities in an end-to-end manner across the whole brain and multiple subjects, addressing limitations of linearity, subject-specificity, and unimodality in existing encoding approaches.

5 retrieved papers
Nonlinear multimodal integration via transformer architecture

The model employs a transformer encoder to learn nonlinear mappings between pretrained multimodal representations (from text, audio, and video foundational models) and brain responses, rather than relying on linear ridge regression used in prior work.

10 retrieved papers
State-of-the-art performance in Algonauts 2025 competition

TRIBE achieved first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors, demonstrating that multimodality benefits are highest in associative cortices and validating the importance of the multimodal, multisubject, and nonlinear design.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TRIBE: first deep neural network for multimodal, multi-subject, whole-brain fMRI prediction

The authors present TRIBE, a novel deep learning pipeline that predicts fMRI brain responses to videos by integrating text, audio, and video modalities in an end-to-end manner across the whole brain and multiple subjects, addressing limitations of linearity, subject-specificity, and unimodality in existing encoding approaches.

Contribution

Nonlinear multimodal integration via transformer architecture

The model employs a transformer encoder to learn nonlinear mappings between pretrained multimodal representations (from text, audio, and video foundational models) and brain responses, rather than relying on linear ridge regression used in prior work.

Contribution

State-of-the-art performance in Algonauts 2025 competition

TRIBE achieved first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors, demonstrating that multimodality benefits are highest in associative cortices and validating the importance of the multimodal, multisubject, and nonlinear design.